Documentation

Create a pan genome from a set of GFF files with WTSI defaults
Take in GFF files and output the proteome
Iteratively run cdhit
Given a spreadsheet of gene presence and absence calculate some statistics
Take in the group statistics spreadsheet and the location of the gene multifasta files and create a core alignment.
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a FASTA file of proteins and blast against itself
Take in multi-FASTA files of nucleotides and align each file with PRANK or MAFFT
Take in a groups file and the protein fasta files and output selected data
Create a pan genome from a set of GFF files
Take in a groups file and a set of GFF files and transfer the consensus annotation

Modules

Create a pan genome
Output a FASTA file which represents the binary presence and absence of genes in the accessory genome
Take an a clusters file from CD-hit and the fasta file and output a fasta file without full clusters
Take in a groups file and the original FASTA files and create plots and stats
Take in a group file and assosiated GFF files for the isolates and update the group name to the gene name
Given a spreadsheet of gene presence and absence calculate some statistics
A role to create a bed file from a gff
Take in a FASTA file and chunk it up into smaller pieces.
A role to read a clusters file from CD hit
Take in multiple FASTA sequences containing proteomes and concat them together and output a FASTA file, filtering out more than 5% X's
Given a spreadsheet of gene presence and absence calculate some statistics
Common command line settings
Take in FASTA files of proteins and cluster them
Take in GFF files and output the proteome
Take in a multifasta file of nucleotides, convert to proteins and align with PRANK
Take in a FASTA file of proteins and blast against itself
Take in a groups file and the protein fasta files and output selected data
Take in FASTA files of proteins and cluster them
Take in the group statistics spreadsheet and the location of the gene multifasta files and create a core alignment.
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a groups file and a set of GFF files and transfer the consensus annotation
Parse a GFF and efficiently and extract ordered gene ids on each contig
Exceptions for input data
Wrapper around NCBIs blastp command
Wrapper to run cd-hit
Check external executables are available and are the correct version
Wrapper to run Fasttree
Take in multi-FASTA files of nucleotides and align each file with PRANK or MAFFT
Iteratively run CDhit
Wrapper to run mafft
Wrapper around NCBIs makeblastdb command
Wrapper around MCL which takes in blast results and outputs clustered results
Perform the post analysis
Wrapper to run prank
Take in a spreadsheet produced by the pipeline and identify the core genes.
Take in a GFF file and create protein sequences in FASTA format
Take in GFF files and create protein sequences in FASTA format
Take an a clusters file from CD-hit and the fasta file and output a fasta file without full clusters
Take in fasta files, remove sequences with too many unknowns and return a list of the new files
Parse a GFF and efficiently extract ID -> Gene Name
Add labels to the groups
Add labels to the groups
Take the clusters file from cd-hit and use it to inflate the output of MCL
Run CDhit iteratively with reducing thresholds, removing full clusters each time
Execute a set of commands locally
A role to add job runner functionality
Take in an ordering of genes and a directory and return an ordered list of file locations
Merge multifasta alignment files with equal numbers of sequences.
Take in GFF files and create a matrix of what genes are beside what other genes
Take in blast results and find the percentage identity graph
Create an embl file for the header with locations of where genes are in the multifasta alignment of core genes
Given two sets of isolates and a group file, output whats unique in each and whats in common
a role containing some common methods for embl header files
Create a tab/embl file with the features for drawing pretty pictures
Take in a group and create a multifasta file
Take in a GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Take a multifasta nucleotide file and output it as proteins.
Take in a list of groups and create multifastas files for each group
Take in a set of GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Create raw output files of group counts for turning into plots
Output the groups of the union of a set of input isolates
Run all against all blast in parallel
A role for parsing a gff file efficiently
Post analysis of pan genomes
Take in a mixture of FASTA and GFF input files and output FASTA proteomes only
Create a matrix with presence and absence
generate a report based on kraken output
Take in gff files and add suffix where a gene id is seen twice
Take in a tree file and a spreadsheet and output a spreadsheet with reordered columns
Take in a tree file and return an ordering of the samples
Take in a fasta file and create a hash with the length of each sequence
sort a fasta file by name
Read and write a spreadsheet