JSONL::Subset
A Perl module to extract a percentage of lines from a JSONL file. Useful for sampling large datasets.
Installation
perl Makefile.PL
make
make test
make install
Usage
use JSONL::Subset qw(subset_jsonl);
subset_jsonl(
infile => "data.jsonl",
outfile => "subset.jsonl",
percent => 10,
mode => "random", # or "start", "end"
seed => 42,
streaming => 1
);
Or from the command line:
jsonl-subset --in data.jsonl --out sample.jsonl --percent 5 --mode random --seed 42 --streaming
Options
infile
Path to the file you want to import from.
outfile
Path to where you want to save the export.
percent
Percentage of lines to retain. Must specify percent XOR lines.
lines
Number of lines to retain. Must specify lines XOR percent.
mode
- random returns random lines
- start returns lines from the start
- end returns lines from the end
seed
Only used with random, for reproducability. (optional)
streaming
If set, infile will be streamed line by line. This makes the process take less RAM, but more wall time.
Recommended for large JSONL files.