NAME

n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.

VERSION

version 1.4.0

SYNOPSIS

n50.pl [options] [FILE1 FILE2 FILE3...]

DESCRIPTION

This program parses a list of FASTA/FASTQ files calculating for each one the number of sequences, the sum of sequences lengths and the N50, N75, N90 and auN*. It will print the result in different formats, by default only the N50 is printed for a single file and all metrics in TSV format for multiple files.

*: See https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity

NAME

n50 - A script to calculate N50 from one or multiple FASTA/FASTQ files.

VERSION

version 1.3.0

PARAMETERS

-o, --sortby

Sort by field: 'N50' (default), 'min', 'max', 'seqs', 'size', 'path'. By default will be descending for numeric fields, ascending for 'path'. See -r, --reverse.

-r, --reverse

Reverse sort (see: -o);

-f, --format

Output format: default, tsv, json, custom, screen. See below for format specific switches. Specify "list" to list available formats.

-e

Also calculate a custom N{e} metric. Expecting an integer 0 < e < 100.

-s, --separator

Separator to be used in 'tsv' output. Default: tab. The 'tsv' format will print a header line, followed by a line for each file given as input with: file path, as received, total number of sequences, total size in bp, and finally N50.

-b, --basename

Instead of printing the path of each file, will only print the filename, stripping relative or absolute paths to it. See -a. Warning: if you are reading multiple files with the same basename, only one will be printed. This is the intended behaviour and you will only receive a warning.

-a, --abspath

Instead of printing the path of each file, as supplied by the user (can be relative), it will the absolute path. Will override -b (basename). See -b.

-u, --noheader

When used with 'tsv' output format, will suppress header line.

-n, --nonewline

If used with 'default' (or 'csv' output format), will NOT print the newline character after the N50 for a single file. Useful in bash scripting:

n50=$(n50.pl filename);
-t, --template

String to be used with 'custom' format. Will be used as template string for each sample, replacing {new} with newlines, {tab} with tab and {N50}, {seqs}, {size}, {path} with sample's N50, number of sequences, total size in bp and file path respectively (the latter will respect --basename if used).

-q, --thousands-sep

Add the thousands separator in all the printed numbers. Enabled by default with --format screen (-x).

-p, --pretty

If used with 'json' output format, will format the JSON in pretty print mode. Example:

{
  "file1.fa" : {
    "size" : 290,
    "N50"  : 290,
    "seqs" : 2
 },
  "file2.fa" : {
    "N50"  : 456,
    "size" : 456,
    "seqs" : 2
 }
}
-h, --help

Will display this full help message and quit, even if other arguments are supplied.

Output formats

These are the values for --format.

tsv (tab separated values)
#path       seqs  size  N50   min   max
test2.fa    8     825   189   4     256
reads.fa    5     247   100   6     102
small.fa    6     130   65    4     65
csv (comma separated values)

Same as --format tsv and --separator ,:

#path,seqs,size,N50,min,max
test.fa,8,825,189,4,256
reads.fa,5,247,100,6,102
small_test.fa,6,130,65,4,65
screen (screen friendly)

Use -x as shortcut for --format screen. Enables --thousands-sep (-q) by default.

.-----------------------------------------------------------------------------------------.
| File          | Seqs | Total bp | N50    | min   | max    | N75   | N90   | auN         |
+---------------+------+----------+--------+-------+--------+-------+-------+-------------+
| big.fa        |    4 |   18,359 | 11,840 | 2,167 | 11,840 | 2,176 | 2,167 | 8923.21,984 |
| sim1.fa       |   39 |   18,864 |    679 |    20 |    971 |   408 |   313 |  733.51,389 |
| sim2.fa       |   21 |    7,530 |    493 |    68 |    989 |   330 |   174 |  575.47,012 |
| test.fa       |    8 |      825 |    189 |     4 |    256 |   168 |   168 |  260.99,515 |
'---------------+------+----------+--------+-------+--------+-------+-------+-------------'
json (JSON)

Use -j as shortcut for --format json.

{
   "data/sim1.fa" : {
      "seqs" : 39,
      "N50" : 679,
      "max" : 971,
      "N90" : 313,
      "min" : 20,
      "size" : 18864,
      "auN" : 733.51389,
      "N75" : 408
   },
   "data/sim2.fa" : {
      "max" : 989,
      "seqs" : 21,
      "N50" : 493,
      "N90" : 174,
      "min" : 68,
      "auN" : 575.47012,
      "N75" : 330,
      "size" : 7530
   }
}
custom

Will print the output using the template string provided with -t TEMPLATE. Fields are in the {field_name} format. {new}/{n}/\n is the newline, {tab}/{t}/\t is a tab. All the keys of the JSON object are valid fields: {seqs}, {N50}, {min}, {max}, {size}.

EXAMPLE USAGES

Screen friendly table (-x is a shortcut for --format screen), sorted by N50 descending (default):

n50.pl -x files/*.fa

Screen friendly table, sorted by total contig length (--sortby max) ascending (--reverse):

n50.pl -x -o max -r files/*.fa

Tabular (tsv) output is default:

n50.pl -o max -r files/*.fa

A custom output format:

n50.pl data/*.fa -f custom -t '{path}{tab}N50={N50};Sum={size}{new}'

COPYRIGHT

Copyright (C) 2017-2019 Andrea Telatin

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

AUTHOR

Andrea Telatin <andrea@telatin.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018-2020 by Andrea Telatin.

This is free software, licensed under:

The MIT (X11) License

AUTHOR

Andrea Telatin <andrea@telatin.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018-2020 by Andrea Telatin.

This is free software, licensed under:

The MIT (X11) License