NAME
fu-uniq - Dereplicate sequences and generate abundance information
VERSION
version 1.7.0
SYNOPSIS
fu-uniq [options] input.fa > uniq.fa
DESCRIPTION
fu-uniq is a tool for dereplicating DNA sequences and generating abundance information. It identifies unique sequences and can track their abundance using USEARCH-style labels. The tool supports both exact sequence matching and customizable output formats.
Key features: - Dereplicates sequences while maintaining abundance information - Supports USEARCH-style size annotations - Flexible sequence naming options - Handles both FASTA and FASTQ inputs - Processes gzipped files automatically
NAME
fu-uniq - Dereplicate sequences and generate abundance information
OPTIONS
Sequence Processing
- -k, --keepname
-
Use the name of the first occurrence of each unique sequence as the cluster name. This is useful for maintaining meaningful identifiers. Default: ON
- -m, --min-size N
-
Only output sequences that appear at least N times. This helps filter out rare sequences or potential sequencing errors. Default: 0 (no filtering)
- --size-as-comment
-
Add size information as a comment rather than part of the sequence name. This affects the format of the output headers. Default: OFF
Example with option OFF: >seq1;size=10; Example with option ON: >seq1 size=10;
Output Formatting
- -p, --prefix STR
-
Prefix for sequence names when not using --keepname. Default: 'seq'
- -s, --separator STR
-
Character(s) to separate prefix from sequence number. Default: '.'
- -w, --line-width N
-
Width for wrapping FASTA sequence lines. Use 0 for single-line sequences. Default: 80
EXAMPLES
Basic deduplication:
# Find unique sequences and add abundance information
fu-uniq input.fa > uniq.fa
Keep only abundant sequences:
# Keep sequences that appear at least 10 times
fu-uniq -m 10 input.fa > abundant.fa
Custom sequence naming:
# Use custom prefix and separator
fu-uniq -p 'cluster' -s '_' input.fa > clusters.fa
Process multiple files:
# Combine and deduplicate multiple files
fu-uniq file1.fa file2.fa > combined_uniq.fa
Add size as comment:
# Place size information in sequence comment
fu-uniq --size-as-comment input.fa > commented.fa
NOTES
Memory usage scales with the number of unique sequences
Sequence comparison is case-insensitive
Size annotations in input files (;size=N;) are respected and combined
MODERN ALTERNATIVE
This suite of tools has been superseded by SeqFu, a compiled program providing faster and safer tools for sequence analysis. This suite is maintained for the higher portability of Perl scripts under certain circumstances.
SeqFu is available at https://github.com/telatin/seqfu2, and can be installed with BioConda conda install -c bioconda seqfu
CITING
Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering 2021, 8, 59. https://doi.org/10.3390/bioengineering8050059
AUTHOR
Andrea Telatin <andrea@telatin.com>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2018-2027 by Quadram Institute Bioscience.
This is free software, licensed under:
The MIT (X11) License