JSONL::Subset

A Perl module to extract a percentage of lines from a JSONL file. Useful for sampling large datasets.

Installation

perl Makefile.PL
make
make test
make install

Usage

use JSONL::Subset qw(subset_jsonl);

subset_jsonl(
    infile    => "data.jsonl",
    outfile   => "subset.jsonl",
    percent   => 10,
    mode      => "random", # or "start", "end"
    seed      => 42,
    streaming => 1
);

Or from the command line:

jsonl-subset --in data.jsonl --out sample.jsonl --percent 5 --mode random --seed 42 --streaming

Options

infile

Path to the file you want to import from.

outfile

Path to where you want to save the export.

percent

Percentage of lines to retain. Must specify percent XOR lines.

lines

Number of lines to retain. Must specify lines XOR percent.

mode

seed

Only used with random, for reproducability. (optional)

streaming

If set, infile will be streamed line by line. This makes the process take less RAM, but more wall time.

Recommended for large JSONL files.