Documentation
Create a pan genome from a set of proteome FASTA files
Take in GFF files and output the proteome
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a FASTA file of proteins and blast against itself
Take in the groups file and output some summary plots
Take in a groups file and the protein fasta files and output selected data
Take in a groups file and a set of GFF files and transfer the consensus annotation
Modules
Create a pan genome
Take in a groups file and the original FASTA files and create plots and stats
Take in a group file and assosiated GFF files for the isolates and update the group name to the gene name
Take in a FASTA file and chunk it up into smaller pieces.
A role to read a clusters file from CD hit
Take in multiple FASTA sequences containing proteomes and concat them together and output a FASTA file, filtering out more than 5% X's
Take in FASTA files of proteins and cluster them
Take in GFF files and output the proteome
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a FASTA file of proteins and blast against itself
Take in the groups file and output some summary plots
Take in a groups file and the protein fasta files and output selected data
Take in a groups file and a set of GFF files and transfer the consensus annotation
Exceptions for input data
Wrapper around NCBIs blastp command
Wrapper to run cd-hit
Wrapper around NCBIs makeblastdb command
Wrapper around MCL which takes in blast results and outputs clustered results
Wrapper around Muscle for sequence alignment
Perform the post analysis
Wrapper around Segmasker for low complexity filtering
Take in a GFF file and create protein sequences in FASTA format
Take in GFF files and create protein sequences in FASTA format
Take an a clusters file from CD-hit and the fasta file and output a fasta file without full clusters
Take in fasta files, remove sequences with too many unknowns and return a list of the new files
Parse a GFF and efficiently extract ID -> Gene Name
Add labels to the groups
Add labels to the groups
Take the clusters file from cd-hit and use it to inflate the output of MCL
Execute a set of commands using LSF
Execute a set of commands locally
A role to add job runner functionality
Given two sets of isolates and a group file, output whats unique in each and whats in common
Take in a group and create a multifasta file
Take in an array of files and output raw tab files for analysis by R to turn into plots
Take in a GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Take in a list of groups and create multifastas files for each group
Take in a set of GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Create raw output files of group counts for turning into plots
Output a fasta file with one gene per group
Output the groups of the union of a set of input isolates
Run all against all blast in parallel
Take in an array of frequencies of groups and output a plot
Post analysis of pan genomes
Take in a mixture of FASTA and GFF input files and output FASTA proteomes only
Take in a tree file and a spreadsheet and output a spreadsheet with reordered columns
Take in a tree file and return an ordering of the samples
Take in a fasta file and create a hash with the length of each sequence