NAME

fu-len - Filter FASTA/FASTQ files by sequence length

VERSION

version 1.7.0

SYNOPSIS

fu-len [options] FILE1 [FILE2 ...]

DESCRIPTION

fu-len is a versatile tool for filtering sequences from FASTA/FASTQ files based on their length. It provides additional functionality for sequence reformatting and name manipulation. The tool can process both FASTA and FASTQ files, including gzipped files, and can handle input from standard input using '-' as the filename.

NAME

fu-len - Filter and manipulate FASTA/FASTQ files based on sequence length

OPTIONS

Input/Output Control

-m, --min INT

Minimum length to keep a sequence. Sequences shorter than this will be filtered out.

-x, --max INT

Maximum length to keep a sequence. Sequences longer than this will be filtered out.

-f, --fasta

Force output in FASTA format, regardless of input format.

-w, --fasta-width INT

Wrap FASTA sequence lines to the specified width. If not specified, sequences will be written as single lines.

Sequence Naming

-n, --namescheme STR

Choose how sequence names should be generated. Available schemes:

  • raw - Use original sequence names (default)

  • num - Number sequences sequentially (see --prefix)

  • file - Use input filename as prefix followed by sequence number

-p, --prefix STR

Prefix to use for sequence names when using the 'num' name scheme.

-s, --separator STR

Separator to use between prefix and number (default: '.').

Sequence Annotation

-l, --len

Add sequence length as a comment to each sequence header.

-c, --strip-comment

Remove existing sequence comments.

Other Options

-v, --verbose

Print verbose information to STDERR.

--version

Print version information and exit.

EXAMPLES

Filter sequences by length:

# Keep sequences between 100 and 1000 bp
fu-len -m 100 -x 1000 input.fa > filtered.fa

Convert FASTQ to wrapped FASTA:

# Convert to FASTA and wrap to 60 characters per line
fu-len -f -w 60 input.fastq > output.fa

Number sequences with custom prefix:

# Add sequential numbers and length information
fu-len -n num -p 'seq' -l input.fa > numbered.fa

Process multiple files:

# Filter all sequences and force FASTA output
fu-len -m 500 -f file1.fq file2.fa > combined.fa

NOTES

When processing multiple files, be aware that:

  • Duplicate sequence names can cause errors

  • Mixing FASTA and FASTQ files without --fasta may cause formatting issues

  • Memory usage increases when checking for duplicate names

MODERN ALTERNATIVE

This suite of tools has been superseded by SeqFu, a compiled program providing faster and safer tools for sequence analysis. This suite is maintained for the higher portability of Perl scripts under certain circumstances.

SeqFu is available at https://github.com/telatin/seqfu2, and can be installed with BioConda conda install -c bioconda seqfu

CITING

Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059

AUTHOR

Andrea Telatin <andrea@telatin.com>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2018-2027 by Quadram Institute Bioscience.

This is free software, licensed under:

The MIT (X11) License