NAME
n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.
VERSION
version 1.4.2
SYNOPSIS
n50.pl [options] [FILE1 FILE2 FILE3...]
DESCRIPTION
This program parses a list of FASTA/FASTQ files calculating for each one the number of sequences, the sum of sequences lengths and the N50, N75, N90 and auN*. It will print the result in different formats, by default only the N50 is printed for a single file and all metrics in TSV format for multiple files.
*: See https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity
NAME
n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.
VERSION
version 1.3.0
PARAMETERS
- -o, --sortby
-
Sort by field: 'N50' (default), 'min', 'max', 'seqs', 'size', 'path'. By default will be descending for numeric fields, ascending for 'path'. See
-r, --reverse
. - -r, --reverse
-
Reverse sort (see:
-o
); - -f, --format
-
Output format: default, tsv, json, custom, screen. See below for format specific switches. Specify "list" to list available formats.
- -e
-
Also calculate a custom N{e} metric. Expecting an integer 0 < e < 100.
- -s, --separator
-
Separator to be used in 'tsv' output. Default: tab. The 'tsv' format will print a header line, followed by a line for each file given as input with: file path, as received, total number of sequences, total size in bp, and finally N50.
- -b, --basename
-
Instead of printing the path of each file, will only print the filename, stripping relative or absolute paths to it. See
-a
. Warning: if you are reading multiple files with the same basename, only one will be printed. This is the intended behaviour and you will only receive a warning. - -a, --abspath
-
Instead of printing the path of each file, as supplied by the user (can be relative), it will the absolute path. Will override -b (basename). See
-b
. - -u, --noheader
-
When used with 'tsv' output format, will suppress header line.
- -n, --nonewline
-
If used with 'default' (or 'csv' output format), will NOT print the newline character after the N50 for a single file. Useful in bash scripting:
n50=$(n50.pl filename);
- -t, --template
-
String to be used with 'custom' format. Will be used as template string for each sample, replacing {new} with newlines, {tab} with tab and {N50}, {seqs}, {size}, {path} with sample's N50, number of sequences, total size in bp and file path respectively (the latter will respect --basename if used).
- -q, --thousands-sep
-
Add the thousands separator in all the printed numbers. Enabled by default with --format screen (-x).
- -p, --pretty
-
If used with 'json' output format, will format the JSON in pretty print mode. Example:
{ "file1.fa" : { "size" : 290, "N50" : 290, "seqs" : 2 }, "file2.fa" : { "N50" : 456, "size" : 456, "seqs" : 2 } }
- -h, --help
-
Will display this full help message and quit, even if other arguments are supplied.
Output formats
These are the values for --format
.
- tsv (tab separated values)
-
#path seqs size N50 min max test2.fa 8 825 189 4 256 reads.fa 5 247 100 6 102 small.fa 6 130 65 4 65
- csv (comma separated values)
-
Same as
--format tsv
and--separator ,
:#path,seqs,size,N50,min,max test.fa,8,825,189,4,256 reads.fa,5,247,100,6,102 small_test.fa,6,130,65,4,65
- screen (screen friendly)
-
Use
-x
as shortcut for--format screen
. Enables --thousands-sep (-q) by default..-----------------------------------------------------------------------------------------. | File | Seqs | Total bp | N50 | min | max | N75 | N90 | auN | +---------------+------+----------+--------+-------+--------+-------+-------+-------------+ | big.fa | 4 | 18,359 | 11,840 | 2,167 | 11,840 | 2,176 | 2,167 | 8923.21,984 | | sim1.fa | 39 | 18,864 | 679 | 20 | 971 | 408 | 313 | 733.51,389 | | sim2.fa | 21 | 7,530 | 493 | 68 | 989 | 330 | 174 | 575.47,012 | | test.fa | 8 | 825 | 189 | 4 | 256 | 168 | 168 | 260.99,515 | '---------------+------+----------+--------+-------+--------+-------+-------+-------------'
- json (JSON)
-
Use
-j
as shortcut for--format json
.{ "data/sim1.fa" : { "seqs" : 39, "N50" : 679, "max" : 971, "N90" : 313, "min" : 20, "size" : 18864, "auN" : 733.51389, "N75" : 408 }, "data/sim2.fa" : { "max" : 989, "seqs" : 21, "N50" : 493, "N90" : 174, "min" : 68, "auN" : 575.47012, "N75" : 330, "size" : 7530 } }
- custom
-
Will print the output using the template string provided with -t TEMPLATE. Fields are in theÂ
{field_name}
format.{new}
/{n}
/\n
is the newline,{tab}
/{t}
/\t
is a tab. All the keys of the JSON object are valid fields:{seqs}
,{N50}
,{min}
,{max}
,{size}
.
EXAMPLE USAGES
Screen friendly table (-x
is a shortcut for --format screen
), sorted by N50 descending (default):
n50.pl -x files/*.fa
Screen friendly table, sorted by total contig length (--sortby max
) ascending (--reverse
):
n50.pl -x -o max -r files/*.fa
Tabular (tsv) output is default:
n50.pl -o max -r files/*.fa
A custom output format:
n50.pl data/*.fa -f custom -t '{path}{tab}N50={N50};Sum={size}{new}'
COPYRIGHT
Copyright (C) 2017-2019 Andrea Telatin
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
CITING
Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059
CONTRIBUTING, BUGS
The repository of this project is available at https://github.com/quadram-institute-bioscience/seqfu/. The Perl module is inside the "n50" subdirectory.
AUTHOR
Andrea Telatin <andrea@telatin.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018-2020 by Andrea Telatin.
This is free software, licensed under:
The MIT (X11) License
AUTHOR
Andrea Telatin <andrea@telatin.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018-2020 by Andrea Telatin.
This is free software, licensed under:
The MIT (X11) License