Hashest
Use hashes to estimate MLST
Usage
hashest-index.pl: indexes a fasta file
Fasta file have deflines in the format of >locus_allele
where locus is a string and allele is an int
Usage: hashest-index.pl [options] *.fasta [*.gbk...]
--k kmer length [default: 16]
--version print version and exit
--help This useful help menu
hashest-search.pl: reports an MLST profile for a genome assembly
Usage: hashest-search.pl [options] *.fasta [*.gbk...] > out.tsv
--db Database from hashest-index.pl
--numcpus Number of threads to use [default: 1]
--dump Dump the database instead of analyzing anything
--help This useful help menu
- Step 1: get a fasta file or set of fasta files with alleles in the format of
>locus_allele, e.g.,>abcZ_1. - Step 2: run
hashest-index.plon the set of fasta file(s) to create a new index. The database is described in its own section below. - Step 3: analyze an assembly against the new index with
hashest-search.pl.
hashest-search results in a tsv stdout output.
Columns are loci, rows are assemblies, and values are alleles.
Tildes (~) represent multiple allele matches and probably multiple copies/variations of a gene.
Question marks (?) indicate a match to a locus via a hash match, but no allele match was found.
Installation
Requires perl with threads and BioPerl
cd ~/bin
git clone git@github.com:lskatz/hashest.git
export PATH=$PATH:~/bin/hashest/scripts
Algorithm
Inspired by Gustle
Uses native perl md5 hashing.
- Index the database
- hash the first k nucleotides of each allele in the database
- save whole sequence of the alleles too
- Save to index file
- Search the database
- hash a sliding window of a genome assembly of k length
- Find the right locus: match hash to locus
- Find the right allele of the locus: match sequence to alleles of locus
- If multiple cpus given, multiple assemblies will be analyzed at the same time, each single threaded.
Database structure
Database is in a Perl storable object, similar to a Python pickle. The data structure has these keys
- locusArray => [array of locus names]
- locus => associative array of
hash=>locusname - allele => associative array of
locus=>[sequence]=>[locus, allele] - settings => information about the database. Stores
k,hashing(hashing ismd5_hexin v0.2 and later).