NAME

CracTools::Utils - A set of useful functions

VERSION

version 1.25

SYNOPSIS

# Reverse complementing a sequence
my $seq = reverseComplemente("ATGC");

# Reading a FASTQ file
my $it = seqFileIterator('file.fastq','fastq');
while(my $entry = $it->()) {
  print "Sequence name   : $entry->{name}
         Sequence        : $entry->{seq}
         Sequence quality: $entry->{qual}","\n";
}

# Reading paired-end files easier
my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
while (my $entry = $it->()) {
  print "Read_1 : $entry->{read1}->{seq}
         Read_2 : $entry->{read2}->{seq}";
}

# Parsing a GFF file
my $it = gffFileIterator($file);
while (my $annot = $it->()) {
  print "chr    : $annot->{chr}
         start  : $annot->{start}
         end    : $annot->{end}";
}

DESCRIPTION

Bio::Lite is a set of subroutines that aims to answer similar questions as Bio-perl distribution in a FAST and SIMPLE way.

Bio::Lite does not make use of complexe data struture, or objects, that would lead to a slow execution.

All methods can be imported with a single "use Bio::Lite".

Bio::Lite is a lightweight-single-module with NO DEPENDENCIES.

UTILS

reverseComplement

Reverse complemente the (nucleotid) sequence in arguement.

Example:

my $seq_revcomp = reverseComplement($seq);

reverseComplement is more than 100x faster than Bio-Perl revcom_as_string()

reverse_tab

Arg [1] : String - a string with values separated with coma.
Example : $reverse = reverse_tab('2,1,1,1,0,0,1');
Description : Reverse the values of the string in argument.
              For example : reverse_tab('1,2,0,1') returns : '1,0,2,1'.
ReturnType  : String
Exceptions  : none

isVersionGreaterOrEqual($v1,$v2)

Return true is version number v1 is greater than v2

convertStrand

Convert strand from '+/-' standard to '1/-1' standard and the opposite.

Example:

say "Forward a: ",convertStrand('+');
say "Forward b: ",convertStrand(1);
say "Reverse a: ",convertStrand('-');
say "Reverss b: ",convertStrand(-1);

will print

Forward a: 1
Forward b: +
Reverse a: -1
Reverse b: -

removeChrPrefix

Remove the "chr" prefix from a given string

Example:

say "reference name: ",removeChrPrefix("chr1");

will print

reference name: 1

addChrPrefix

Add the "chr" prefix to the given string

ENCODING

encodePosListToBase64

Encode a (0-based) list of increasing position to a string using Base64 encoding scheme : ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

my $encoded_list = CracTools::Utils::encodePosListToBase64(1,3,5,8,12,32);
my @decoded_list = CracTools::Utils::decodePosListInBase64($encoded_list);

decodePosListInBase64

Decode position list encoded by encodePosListToBase64.

PARSING

This are some tools that aim to read (bio) files like

Sequence files : FASTA, FASTQ
Annotation files : GFF3, GTF2, BED6, BED12, ...
Alignement files : SAM, BAM

seqFileIterator

Open Fasta, or Fastq files (can be gziped). seqFileIterator has an automatic file extension detection but you can force it using a second parameter with the format : 'fasta' or 'fastq'.

Example:

my $it = seqFileIterator('file.fastq','fastq');
while(my $entry = $it->()) {
  print "Sequence name   : $entry->{name}
         Sequence        : $entry->{seq}
         Sequence quality: $entry->{qual}","\n";
}

Return: HashRef

{ name => 'sequence_identifier',
  seq  => 'sequence_value',
  qual => 'sequence_quality', # only defined for FASTQ files
}

seqFileIterator is more than 50x faster than Bio-Perl Bio::SeqIO for FASTQ files seqFileIterator is 4x faster than Bio-Perl Bio::SeqIO for FASTA files

pairedEndSeqFileIterator

Open Paired-End Sequence files using seqFileIterator()

Paird-End files are generated by Next Generation Sequencing technologies (like Illumina) where two reads are sequenced from the same DNA fragment and saved in separated files.

Example:

my $it = pairedEndSeqFileIterator($reads1,$reads2,$format);
while (my $entry = $it->()) {
  print "Read_1 : $entry->{read1}->{seq}
         Read_2 : $entry->{read2}->{seq}";
}

Return: HashRef

{ read1 => 'see seqFileIterator() return',
  read2 => 'see seqFileIterator() return'
}

pairedEndSeqFileIterator has no equivalent in Bio-Perl

writeSeq

CracTools::Utils::writeSeq($filehandle,$format,$seq_name,$seq,$seq_qual)

Write the sequence in the output stream with the specified format.

bedFileIterator

manage BED files format

Example:

my $it = bedFileIterator($file);
while (my $annot = $it->()) {
  print "chr    : $annot->{chr}
         start  : $annot->{start}
         end    : $annot->{end}";
}

Return a hashref with the annotation parsed:

{ chr         => 'field_1',
  start       => 'field_2',
  end         => 'field_3',
  name        => 'field_4',
  score       => 'field_5',
  strand      => 'field_6',
  thick_start => 'field_7',
  thick_end   => 'field_8',
  rgb         => 'field_9'
  blocks      => [ {'size' => 'block size',
                    'start' => 'block start',
                    'end'   => 'block start + block_size',
                    'ref_start' => 'block start on the reference',
                    'ref_end'   => 'block end on the reference'}, ... ],
  seek_pos    => 'Seek position of this line in the file',
}

gffFileIterator

manage GFF3 and GTF2 file format

Example:

my $it = gffFileIterator($file,'type');
while (my $annot = $it->()) {
  print "chr    : $annot->{chr}
         start  : $annot->{start}
         end    : $annot->{end}";
}

Return a hashref with the annotation parsed:

{ chr         => 'field_1',
  source      => 'field_2',
  feature     => 'field_3',
  start       => 'field_4',
  end         => 'field_5',
  score       => 'field_6',
  strand      => 'field_7',
  frame       => 'field_8'
  attributes  => { 'attribute_id' => 'attribute_value', ...},
  seek_pos    => 'Seek position of this line in the file',
}

gffFileIterator is 5x faster than Bio-Perl Bio::Tools::GFF

vcfFileIterator

manage VCF file format

Return a hashref with the annotation parsed:

{ chr => $chr,
  pos     => $pos,
  id      => $id,
  ref     => $ref,
  alt     => [ alt1, alt2, ...],
  qual    => $qual,
  filter  => $filter,
  info    => { AS => value,
               DP => value,
               ...
               ,
};

chimCTFileIterator

Return a hashref with the chimera parsed:

{
  sample            => $sample,
  chim_key          => $chim_key,
  name              => $name,
  chr1              => $chr1,
  pos1              => $pos1,
  strand1           => $strand1,
  chr2              => $chr2,
  pos2              => $pos2,
  strand2           => $strand2,
  chim_value        => $chim_value,
  spanning_junction => $spanning_junction,
  spanning_PE       => $spanning_PE,
  class             => $class,
  comments          => { coment_id => 'comment_value', ... },
  extended_fields     => { extended_field_id => 'extended_field_value', ... },
}

bamFileIterator

BE AWARE this method is only availble if samtools binary is availble.

Return an iterator over a BAM file using a samtools view pipe.

A region can be passed in parameter to restrict the results. In this case the BAM file must be indexed

Example:

my $fh = bamFileIterator("file.bam","17:43,971,748-44,105,700");
while(my $line = <$fh>) {
  my $parsed_line = CracTools::SAMReader::SAMline->new($line);
  // do some stuff
}

SEE ALSO CracTools::SAMReader::SAMline if you need to parse SAMlines easily

getSeqFromIndexedRef

BE AWARE this method is only availble if samtools binary is availble.

Return a sequence from a given region in a fasta indexed file

Example:

my $fasta_seq = getSeqFromIndexedRef("file.fa","chr2",29012,10);
my $seq       = getSeqFromIndexedRef("file.fa","chr2",29012,10,'raw');

PARSING LINES

parseBedLine

parseGFFLine

parseVCFLine

parseChimCTLine

parseSAMLineLite

parseCigarChain

Given a CIGAR chain (see SAM specification), return a parsed version as an Array ref of cigar elements represented as { nb => 10, op => 'M' }.

FILES IO

getFileIterator

Generic method to parse files.

getReadingFileHandle

Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)

Example:

my $fh = getReadingFileHandle('file.txt.gz');
while(<$fh>) {
  print $_;
}
close $fh;

getWritingFileHandle

Return a file handle for the file in argument. Display errors if file cannot be oppenned and manage gzipped files (based on .gz file extension)

Example:

my $fh = getWritingFileHandle('file.txt.gz');
print $fh "Hello world\n";
close $fh;

getLineFromSeekPos

getLineFromSeekPos($filehandle,$seek_pos);

return a chomped line at a seeking position.

AUTHORS

  • Nicolas PHILIPPE <nphilippe.research@gmail.com>

  • Jérôme AUDOUX <jaudoux@cpan.org>

  • Sacha BEAUMEUNIER <sacha.beaumeunier@gmail.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2017 by IRMB/INSERM (Institute for Regenerative Medecine and Biotherapy / Institut National de la Santé et de la Recherche Médicale) and AxLR/SATT (Lanquedoc Roussilon / Societe d'Acceleration de Transfert de Technologie).

This is free software, licensed under:

The GNU Affero General Public License, Version 3, November 2007