Documentation
Create a pan genome from a set of proteome FASTA files
Take in GFF files and output the proteome
Iteratively run cdhit
Take in the group statistics spreadsheet and the location of the gene multifasta files and create a core alignment.
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a FASTA file of proteins and blast against itself
Take in a multifasta file of nucleotides, convert to proteins and align with muscle
Take in a groups file and the protein fasta files and output selected data
Create a pan genome from a set of proteome FASTA files
Take in a groups file and a set of GFF files and transfer the consensus annotation
Modules
Create a pan genome
Take in a groups file and the original FASTA files and create plots and stats
Take in a group file and assosiated GFF files for the isolates and update the group name to the gene name
A role to create a bed file from a gff
Take in a FASTA file and chunk it up into smaller pieces.
A role to read a clusters file from CD hit
Take in multiple FASTA sequences containing proteomes and concat them together and output a FASTA file, filtering out more than 5% X's
Common command line settings
Take in FASTA files of proteins and cluster them
Take in GFF files and output the proteome
Iteratively run cdhit
Take in a FASTA file of proteins and blast against itself
Take in a multifasta file of nucleotides, convert to proteins and align with muscle
Take in a groups file and the protein fasta files and output selected data
Take in FASTA files of proteins and cluster them
Take in the group statistics spreadsheet and the location of the gene multifasta files and create a core alignment.
Perform the post analysis on the pan genome
Take in a tree and a spreadsheet and output a reordered spreadsheet
Take in a groups file and a set of GFF files and transfer the consensus annotation
Parse a GFF and efficiently and extract ordered gene ids on each contig
Exceptions for input data
Wrapper around NCBIs blastp command
Wrapper to run cd-hit
Iteratively run CDhit
Wrapper around NCBIs makeblastdb command
Wrapper around MCL which takes in blast results and outputs clustered results
Perform the post analysis
Wrapper to run prank
Take in a multifasta file of nucleotides, convert to proteins and align with muscle
Take in a spreadsheet produced by the pipeline and identify the core genes.
Take in a GFF file and create protein sequences in FASTA format
Take in GFF files and create protein sequences in FASTA format
Take an a clusters file from CD-hit and the fasta file and output a fasta file without full clusters
Take in fasta files, remove sequences with too many unknowns and return a list of the new files
Parse a GFF and efficiently extract ID -> Gene Name
Add labels to the groups
Add labels to the groups
Take the clusters file from cd-hit and use it to inflate the output of MCL
Run CDhit iteratively with reducing thresholds, removing full clusters each time
Execute a set of commands locally
Use GNU Parallel
A role to add job runner functionality
Take in an ordering of genes and a directory and return an ordered list of file locations
Merge multifasta alignment files with equal numbers of sequences.
Take in GFF files and create a matrix of what genes are beside what other genes
Take in blast results and find the percentage identity graph
Given two sets of isolates and a group file, output whats unique in each and whats in common
Create a tab/embl file with the features for drawing pretty pictures
Take in a group and create a multifasta file
Take in a GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Take a multifasta nucleotide file and output it as proteins.
Take in a list of groups and create multifastas files for each group
Take in a set of GFF files and a groups file and output one multifasta file per group with nucleotide sequences.
Create raw output files of group counts for turning into plots
Output the groups of the union of a set of input isolates
Run all against all blast in parallel
A role for parsing a gff file efficiently
Post analysis of pan genomes
Take in a mixture of FASTA and GFF input files and output FASTA proteomes only
generate a report based on kraken output
Take in gff files and add suffix where a gene id is seen twice
Take in a tree file and a spreadsheet and output a spreadsheet with reordered columns
Take in a tree file and return an ordering of the samples
Take in a fasta file and create a hash with the length of each sequence
sort a fasta file by name