Hashest

Use hashes to estimate MLST

Usage

hashest-index.pl: indexes a fasta file
  Fasta file have deflines in the format of >locus_allele
    where locus is a string and allele is an int
  Usage: hashest-index.pl [options] *.fasta [*.gbk...]
  --k       kmer length [default: 16]
  --version print version and exit
  --help    This useful help menu

hashest-search.pl: reports an MLST profile for a genome assembly
  Usage: hashest-search.pl [options] *.fasta [*.gbk...] > out.tsv
    --db      Database from hashest-index.pl
    --numcpus Number of threads to use [default: 1]
    --dump    Dump the database instead of analyzing anything 
    --help    This useful help menu

Step 1: get a fasta file or set of fasta files with alleles in the format of >locus_allele, e.g., >abcZ_1.
Step 2: run hashest-index.pl on the set of fasta file(s) to create a new index. The database is described in its own section below.
Step 3: analyze an assembly against the new index with hashest-search.pl.

hashest-search results in a tsv stdout output. Columns are loci, rows are assemblies, and values are alleles. Tildes (~) represent multiple allele matches and probably multiple copies/variations of a gene. Question marks (?) indicate a match to a locus via a hash match, but no allele match was found.

Installation

Requires perl with threads and BioPerl

cd ~/bin
git clone git@github.com:lskatz/hashest.git
export PATH=$PATH:~/bin/hashest/scripts

Algorithm

Inspired by Gustle

Uses native perl md5 hashing.

Index the database
- hash the first k nucleotides of each allele in the database
- save whole sequence of the alleles too
- Save to index file
Search the database
- hash a sliding window of a genome assembly of k length
- Find the right locus: match hash to locus
- Find the right allele of the locus: match sequence to alleles of locus
- If multiple cpus given, multiple assemblies will be analyzed at the same time, each single threaded.

Database structure

Database is in a Perl storable object, similar to a Python pickle. The data structure has these keys

locusArray => [array of locus names]
locus => associative array of hash=>locusname
allele => associative array of locus => [sequence] => [locus, allele]
settings => information about the database. Stores k, hashing (hashing is md5_hex in v0.2 and later).

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)