NAME
CracTools::Utils - A set of useful functions
VERSION
version 1.251
SYNOPSIS
# Reverse complementing a sequence
my $seq = reverseComplemente("ATGC");
# Reading a FASTQ file
my $it = seqFileIterator('file.fastq','fastq');
while(my $entry = $it->()) {
print "Sequence name : $entry->{name}
Sequence : $entry->{seq}
Sequence quality: $entry->{qual}","\n";
}
# Reading paired-end files easier
my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
while (my $entry = $it->()) {
print "Read_1 : $entry->{read1}->{seq}
Read_2 : $entry->{read2}->{seq}";
}
# Parsing a GFF file
my $it = gffFileIterator($file);
while (my $annot = $it->()) {
print "chr : $annot->{chr}
start : $annot->{start}
end : $annot->{end}";
}
DESCRIPTION
Bio::Lite is a set of subroutines that aims to answer similar questions as Bio-perl distribution in a FAST and SIMPLE way.
Bio::Lite does not make use of complexe data struture, or objects, that would lead to a slow execution.
All methods can be imported with a single "use Bio::Lite".
Bio::Lite is a lightweight-single-module with NO DEPENDENCIES.
UTILS
reverseComplement
Reverse complemente the (nucleotid) sequence in arguement.
Example:
my $seq_revcomp = reverseComplement($seq);
reverseComplement is more than 100x faster than Bio-Perl revcom_as_string()
reverse_tab
Arg [1] : String - a string with values separated with coma.
Example : $reverse = reverse_tab('2,1,1,1,0,0,1');
Description : Reverse the values of the string in argument.
For example : reverse_tab('1,2,0,1') returns : '1,0,2,1'.
ReturnType : String
Exceptions : none
isVersionGreaterOrEqual($v1,$v2)
Return true is version number v1 is greater than v2
convertStrand
Convert strand from '+/-' standard to '1/-1' standard and the opposite.
Example:
say "Forward a: ",convertStrand('+');
say "Forward b: ",convertStrand(1);
say "Reverse a: ",convertStrand('-');
say "Reverss b: ",convertStrand(-1);
will print
Forward a: 1
Forward b: +
Reverse a: -1
Reverse b: -
removeChrPrefix
Remove the "chr" prefix from a given string
Example:
say "reference name: ",removeChrPrefix("chr1");
will print
reference name: 1
addChrPrefix
Add the "chr" prefix to the given string
ENCODING
encodePosListToBase64
Encode a (0-based) list of increasing position to a string using Base64 encoding scheme : ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
my $encoded_list = CracTools::Utils::encodePosListToBase64(1,3,5,8,12,32);
my @decoded_list = CracTools::Utils::decodePosListInBase64($encoded_list);
decodePosListInBase64
Decode position list encoded by encodePosListToBase64.
PARSING
This are some tools that aim to read (bio) files like
- Sequence files : FASTA, FASTQ
- Annotation files : GFF3, GTF2, BED6, BED12, ...
- Alignement files : SAM, BAM
seqFileIterator
Open Fasta, or Fastq files (can be gziped). seqFileIterator has an automatic file extension detection but you can force it using a second parameter with the format : 'fasta' or 'fastq'.
Example:
my $it = seqFileIterator('file.fastq','fastq');
while(my $entry = $it->()) {
print "Sequence name : $entry->{name}
Sequence : $entry->{seq}
Sequence quality: $entry->{qual}","\n";
}
Return: HashRef
{ name => 'sequence_identifier',
seq => 'sequence_value',
qual => 'sequence_quality', # only defined for FASTQ files
}
seqFileIterator is more than 50x faster than Bio-Perl Bio::SeqIO for FASTQ files seqFileIterator is 4x faster than Bio-Perl Bio::SeqIO for FASTA files
pairedEndSeqFileIterator
Open Paired-End Sequence files using seqFileIterator()
Paird-End files are generated by Next Generation Sequencing technologies (like Illumina) where two reads are sequenced from the same DNA fragment and saved in separated files.
Example:
my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
while (my $entry = $it->()) {
print "Read_1 : $entry->{read1}->{seq}
Read_2 : $entry->{read2}->{seq}";
}
Return: HashRef
{ read1 => 'see seqFileIterator() return',
read2 => 'see seqFileIterator() return'
}
pairedEndSeqFileIterator has no equivalent in Bio-Perl
writeSeq
CracTools::Utils::writeSeq($filehandle,$format,$seq_name,$seq,$seq_qual)
Write the sequence in the output stream with the specified format.
bedFileIterator
manage BED files format
Example:
my $it = bedFileIterator($file);
while (my $annot = $it->()) {
print "chr : $annot->{chr}
start : $annot->{start}
end : $annot->{end}";
}
Return a hashref with the annotation parsed:
{ chr => 'field_1',
start => 'field_2',
end => 'field_3',
name => 'field_4',
score => 'field_5',
strand => 'field_6',
thick_start => 'field_7',
thick_end => 'field_8',
rgb => 'field_9'
blocks => [ {'size' => 'block size',
'start' => 'block start',
'end' => 'block start + block_size',
'ref_start' => 'block start on the reference',
'ref_end' => 'block end on the reference'}, ... ],
seek_pos => 'Seek position of this line in the file',
}
gffFileIterator
manage GFF3 and GTF2 file format
Example:
my $it = gffFileIterator($file,'type');
while (my $annot = $it->()) {
print "chr : $annot->{chr}
start : $annot->{start}
end : $annot->{end}";
}
Return a hashref with the annotation parsed:
{ chr => 'field_1',
source => 'field_2',
feature => 'field_3',
start => 'field_4',
end => 'field_5',
score => 'field_6',
strand => 'field_7',
frame => 'field_8'
attributes => { 'attribute_id' => 'attribute_value', ...},
seek_pos => 'Seek position of this line in the file',
}
gffFileIterator is 5x faster than Bio-Perl Bio::Tools::GFF
vcfFileIterator
manage VCF file format
Return a hashref with the annotation parsed:
{ chr => $chr,
pos => $pos,
id => $id,
ref => $ref,
alt => [ alt1, alt2, ...],
qual => $qual,
filter => $filter,
info => { AS => value,
DP => value,
...
,
};
chimCTFileIterator
Return a hashref with the chimera parsed:
{
sample => $sample,
chim_key => $chim_key,
name => $name,
chr1 => $chr1,
pos1 => $pos1,
strand1 => $strand1,
chr2 => $chr2,
pos2 => $pos2,
strand2 => $strand2,
chim_value => $chim_value,
spanning_junction => $spanning_junction,
spanning_PE => $spanning_PE,
class => $class,
comments => { coment_id => 'comment_value', ... },
extended_fields => { extended_field_id => 'extended_field_value', ... },
}
bamFileIterator
BE AWARE this method is only availble if samtools
binary is availble.
Return an iterator over a BAM file using a samtools view
pipe.
A region can be passed in parameter to restrict the results. In this case the BAM file must be indexed
Example:
my $fh = bamFileIterator("file.bam","17:43,971,748-44,105,700");
while(my $line = <$fh>) {
my $parsed_line = CracTools::SAMReader::SAMline->new($line);
// do some stuff
}
SEE ALSO CracTools::SAMReader::SAMline if you need to parse SAMlines easily
getSeqFromIndexedRef
BE AWARE this method is only availble if samtools
binary is availble.
Return a sequence from a given region in a fasta indexed file
Example:
my $fasta_seq = getSeqFromIndexedRef("file.fa","chr2",29012,10);
my $seq = getSeqFromIndexedRef("file.fa","chr2",29012,10,'raw');
PARSING LINES
parseBedLine
parseGFFLine
parseVCFLine
parseChimCTLine
parseSAMLineLite
parseCigarChain
Given a CIGAR chain (see SAM specification), return a parsed version as an Array ref of cigar elements represented as { nb => 10, op => 'M' }.
FILES IO
getFileIterator
Generic method to parse files.
getReadingFileHandle
Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)
Example:
my $fh = getReadingFileHandle('file.txt.gz');
while(<$fh>) {
print $_;
}
close $fh;
getWritingFileHandle
Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)
Example:
my $fh = getWritingFileHandle('file.txt.gz');
print $fh "Hello world\n";
close $fh;
getLineFromSeekPos
getLineFromSeekPos($filehandle,$seek_pos);
return a chomped line at a seeking position.
AUTHORS
Nicolas PHILIPPE <nphilippe.research@gmail.com>
Jérôme AUDOUX <jaudoux@cpan.org>
Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2017 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).
This is free software, licensed under:
The GNU Affero General Public License, Version 3, November 2007