Summary
GenOO is an open-source perl framework which models biological entities into Perl objects and provides methods and objects that allow the manipulation of common file types used in sequencing such as SAM, BED, FASTQ and others. Bioinformaticians can greatly benefit by this framework since it will allow them to focus on the actual analysis required instead of coping with the boilerplate of managing the data at hand. In contrast to other existing frameworks such as BioPerl, GenOO has been designed from scratch in a modular way with minimal requirements on external libraries.
Design
The GenOO framework has been developed around Moose, a widely used modern object system for Perl 5. We have used Moose as the base for almost all GenOO classes allowing for significantly more concise, flexible and extensible code. In Moose one of the core entities is the Role. We use Roles to avoid deep inheritance trees. Object instantiation is mainly performed through factory classes making the code much easier to read and extend. We use the Dependency injection design pattern wherever possible to remove hard-coded dependencies from within the classes and make it possible to easily modify them.
To support further development and improvement we have implemented an extended test suite. The suite is based on object oriented code and covers most of the framework’s functionality, providing a safety net for future refactoring and development.
Basic Concepts
The backbone of the GenOO framework is the Region role. This corresponds to a generic area on a reference sequence. The role requires other classes that consume it to implement specific attributes such as strand, rname (reference name), start, stop and copy_number. Provided these attributes are implemented Region gives advanced methods such as the distance from another region for free. This role is consumed by several other classes within the framework and provides common grounds for code integration. Practically eny entity that contains the notion of region can be compared to any other that also has this notion (eg a gene can be compared to an aligned sequencing read).
Going one step forward, the GenomicRegion class consumes the Region role and sets the reference sequence to a particular chromosome. GenomicRegion also supports the species attribute which enables genomic analysis for different biological species simultaneously. The GenomicRegion is the base for more advanced classes of specific genomic elements such as genes, gene transcripts and others.
Regarding the genomic group of classes, the Transcript class corresponds to a gene transcript/isoform and can be an independent object or more commonly belong to a Gene object. A Gene, in essence, is defined as a collection of Transcript objects. Obviously, these two classes can communicate with each other to extract required information. Transcripts are divided into protein coding and noncoding ones. Protein coding transcripts have methods that can extract the coding (CDS), 5’ UTR (UTR5) and 3’UTR (UTR3) sequences and coordinates. Genes on the other hand are not divided into protein coding and noncoding ones. Instead one can ask if a gene has coding potential or not. In this case the gene scans through its transcripts and checks if there are any coding ones or not.
An important structure within the genomic group of classes is the Spliceable role. Spliceable provides functionality for entities/classes that undergo splicing and supports several advanced methods such as exonic and intronic elements extraction and facilitates management of the complex structure. Spliceable is primarily consumed by Transcript but it is also consumed by UTR5, UTR3 and CDS. For example, this way one can ask only for the introns that are contained within the 3'UTR sequence of a transcript ($transcript->utr3->introns)
In a High Througtput Sequencing (HTS) analysis a user usually needs to perform operations and analysis on groups/collections of regions. The RegionCollection role comes to the rescue here. It basically defines the interface for classes that serve as a collection of regions and leaves the specific engine implementation hidden. Perhaps one of the most common queries in HTS analysis is for regions that fulfill certain positional criteria within a collection (eg. SNPs/reads/transcipts/etc that overlap with region at chrX:10000-20000). This query, if implemented in a naive, brute force approach can be very expensive. The strangely named DoubleHashArray (after the data structure used) engine tackles this computational problem. BED/SAM and other formats can be automatically read and converted into a collection of regions that use this engine. However, this structure suffers from the fact that all data are kept in memory and therefore can only be used for relatively small data sets and mostly for prototyping and draft solutions. Luckily there is another pure database oriented collection engine which is named GenOO::Data::DB::DBIC. Currently, the implemented classes support database tables that have at least the following columns: strand, rname, start, stop, copy_number, sequence, cigar (the CIGAR string of the SAM format), mdz (the MD:Z tag of the SAM format), number_of_best_hits. We believe that this covers most uses but if not the user can easily extend them to support any table schema provided that it supports all columns/attributes defined in Region. GenOO::Data::DB::DBIC is based on DBIx::Class which is a modern Perl module that provides an extensible and flexible object-relational mapper. DBIx::Class supports most major databases such as SQLite, MySQL, PostgreSQL and Oracle.