public class SeqUtil extends Object
Modifier and Type | Field and Description |
---|---|
static String |
FASTA_HEADER_DEFAULT_DELIM
Default 1st character for a FASTA file: ">"
|
static String |
ILLUMINA_FW_READ_IND
Illumina forward read indicator found in sequence header
|
static String |
ILLUMINA_RV_READ_IND
Illumina reverse read indicator found in sequence header
|
Modifier and Type | Method and Description |
---|---|
static long |
countNumReads(File seqFile)
Method counts number of reads in the given sequence file by counting the number of lines and dividing by the
number of lines/sample (fasta=2, fastq=4)
|
static String |
getHeader(String line)
Extract the header from the first line of a read (parse to first space).
|
static Set<String> |
getHeaders(File seq)
Return the header of each read in the sequence file.
|
static Set<String> |
getHeaders(File fwRead,
File rvRead)
Get valid headers found in both forward and reverse read.
|
static String |
getIupacBase(String base)
Return regex version of IUPAC DNA substitution bases so that only ACGT values are used.
|
static int |
getNumLinesPerRead()
Return number of lines per read (for fasta return 2, for fastq return 4)
|
static long |
getNumReads(long numLines)
Return number of reads given number of lines in a sample.
|
static Map<File,File> |
getPairedReads(Collection<File> files)
Paired reads must have a unique file suffix to identify forward and reverse reads.
|
static String |
getReadDirectionSuffix(File file)
Return read direction indicator for forward or reverse read if found in the file name.
|
static String |
getReadDirectionSuffix(String fileName)
Return read direction indicator for forward or reverse read if found in the file name.
|
static String |
getSampleId(String value)
Method extracts Sample ID from the name param.
|
static List<File> |
getSeqFiles(Collection<File> files)
Return only sequence files for sample IDs found in the metadata file.
If Config ."metadata.required" = "Y", an
error is thrown to list the files that cannot be matched to a metadata row. |
static List<String> |
getSeqHeaderChars()
Get all sequence header characters for fasta and fastq files.
|
static String |
getSeqType()
Get sequence type
|
static void |
initialize()
Initialize Config params set by SeqUtil.
|
protected static void |
initSeqParams()
Set "input.ignoreFiles", , and
Ignore the metadata file
Config ."metadata.filePath"
If Config pipeline contains biolockj.module.seq or
biolockj.module.classifier BioModule s, ignore non-(fasta/fastq) files found in
Constants.INPUT_DIRS
|
static boolean |
isFastA()
Return TRUE if input files are in FastA format.
|
static boolean |
isFastQ()
Return TRUE if input files are in FastQ format.
|
static boolean |
isForwardRead(String name)
Return TRUE if reads are unpaired or if name does not contain the reverse read file suffix:
|
static boolean |
isGzipped(String fileName)
Determine if file is gzipped based on its file extension.
Any file ending with ".gz" is treated as a gzipped file. |
static Boolean |
isMultiplexed()
Check current state of sequence data.
|
static boolean |
isSeqFile(File file)
Verify 1st character of sequence header and mask 1st sequence for valid DNA/RNA bases "acgtu"
|
static boolean |
isSeqModule(BioModule module)
Check the module to determine if it generated sequence file output.
|
static boolean |
piplineHasSeqInput()
Return TRUE if pipeline input files are sequence files.
|
protected static void |
registerPairedReadStatus()
Inspect the pipeline input files to determine if input includes paired reads.
|
static String |
reverseComplement(String dna)
Return the DNA reverse compliment for the input dna parameter.
|
static String |
scanFirstLine(BufferedReader reader,
File file)
This method returns the 1st non-empty line and moves the BufferedReader pointer to this line.
|
public static final String FASTA_HEADER_DEFAULT_DELIM
public static final String ILLUMINA_FW_READ_IND
public static final String ILLUMINA_RV_READ_IND
public static long countNumReads(File seqFile) throws Exception
seqFile
- Sequence fileException
- if errors occurpublic static String getHeader(String line)
To extract the header, trim to the Illumina read direction indicator, if it exists, otherwise to the 1st space. Method recognizes the headers: " 1:N:" and " 2:N:"
line
- Sequence linepublic static Set<String> getHeaders(File seq) throws Exception
seq
- Fasta or Fastq fileException
- if unable to parse the seq filepublic static Set<String> getHeaders(File fwRead, File rvRead) throws Exception
fwRead
- Forward read sequence filervRead
- Reverse read sequence fileException
- if I/O errors occurpublic static String getIupacBase(String base)
base
- DNA Basepublic static int getNumLinesPerRead() throws Exception
Exception
- if unable to determine sequence formatpublic static long getNumReads(long numLines) throws Exception
numLines
- Total number of lines in sampleException
- if number of reads cannot be determinedpublic static Map<File,File> getPairedReads(Collection<File> files) throws Exception
files
- List of paired read filesConfigViolationException
- if unpaired reads are found and
Config
. = "Y"Exception
- if other errors occurpublic static String getReadDirectionSuffix(File file) throws Exception
file
- Sequence fileException
- if errors occurpublic static String getReadDirectionSuffix(String fileName) throws Exception
fileName
- Sequence file nameException
- if errors occurpublic static String getSampleId(String value) throws Exception
Config
."metadata.fileNameColumn" is supplied, then possible return
values are limited to the given samples ids, or "" if the file is not in the filename column.value
- File name or sequence headerException
- if unable to determine Sample IDpublic static List<File> getSeqFiles(Collection<File> files) throws Exception
Config
."metadata.required" = "Y", an
error is thrown to list the files that cannot be matched to a metadata row.files
- List of input filesException
- if no input files are foundpublic static final List<String> getSeqHeaderChars() throws Exception
Exception
- if unable to determine sequence typepublic static String getSeqType() throws ConfigNotFoundException
ConfigNotFoundException
- if property is undefinedpublic static void initialize() throws Exception
Exception
- if runtime errors occurpublic static boolean isFastA() throws Exception
Exception
- if unable to determine sequence typepublic static boolean isFastQ() throws Exception
Exception
- if unable to determine sequence typepublic static boolean isForwardRead(String name)
name
- Name of filepublic static boolean isGzipped(String fileName)
fileName
- File namepublic static Boolean isMultiplexed() throws ConfigFormatException
ConfigFormatException
- if property assignment is invalidpublic static boolean isSeqFile(File file) throws Exception
file
- FileException
- if errors occurpublic static boolean isSeqModule(BioModule module)
module
- BioModulepublic static boolean piplineHasSeqInput() throws Exception
Exception
- if errors occurpublic static String reverseComplement(String dna) throws Exception
dna
- DNA base sequenceException
- if sequence contains a non-standard letter (only ACGT accepted)public static String scanFirstLine(BufferedReader reader, File file) throws Exception
reader
- BufferedReader reader for sequence filefile
- Sequence file read by the BufferedReader (used for log and error messages)Exception
- if errors occurprotected static void initSeqParams() throws Exception
Config
."metadata.filePath"
Config
pipeline contains biolockj.module.seq
or
biolockj.module.classifier
BioModule
s, ignore non-(fasta/fastq) files found in
Constants.INPUT_DIRS
Exception
- if Constants.INPUT_DIRS
undefined or file reader I/O Exception occurs