NAME
Flat::Profile - Streaming-first profiling for CSV/TSV flat files
SYNOPSIS
use Flat::Profile;
my $p = Flat::Profile->new();
my $it = $p->iter_rows(
path => "data.csv",
has_header => 1,
delimiter => ",",
encoding => "UTF-8",
);
while (my $row = $it->next_row) {
# $row is an arrayref
}
my $report = $p->profile_file(
path => "data.csv",
has_header => 1,
delimiter => ",",
null_empty => 1,
null_tokens => ["NULL", "NA"],
example_cap => 10,
max_errors => 1000,
);
DESCRIPTION
Flat::Profile is part of the Flat::* series. It provides streaming-first profiling for CSV/TSV inputs for practical ETL and legacy data workflows.
Design goals:
Streaming-first (single pass, predictable memory)
Practical diagnostics (ragged rows, null policy, examples)
Stable report format intended to feed Flat::Schema / Flat::Validate
METHODS
new
my $p = Flat::Profile->new();
Constructor. Takes named arguments (currently reserved for future configuration).
iter_rows
my $it = $p->iter_rows(%args);
Returns an iterator object (Flat::Profile::Iterator).
Required named arguments:
path
Common named arguments:
has_header (boolean)
delimiter ("," or "\t")
encoding (default "UTF-8")
profile_file
my $report = $p->profile_file(%args);
Profiles a CSV/TSV file in a streaming pass and returns a hashref report.
Key named arguments include:
path (required)
has_header
delimiter
encoding
null_empty (default true)
null_tokens (arrayref; default empty)
example_cap (default 10)
max_errors (threshold stop; default 1000)
NULL SEMANTICS
By default, empty string is treated as null:
null_empty => 1 # default
To treat empty string as a value:
null_empty => 0
You can also treat specific exact tokens as null:
null_tokens => ["NULL", "N/A"]
Notes:
Token matching is exact (no trimming, case-sensitive) in v1.
undef is always treated as null.
RAGGED ROWS
Flat::Profile tracks width mismatches relative to an expected width:
If has_header is true, expected width is the header width.
Otherwise, expected width is the first data row width.
Row numbers in ragged examples use data-row numbering (header excluded): the first data row is row_number 1.
REPORT FORMAT
profile_file() returns a hashref with stable top-level metadata including:
report_version
generated_at (UTC timestamp string)
perl_version
module_version
header (arrayref or undef)
rows (data rows processed; header excluded)
ragged (counts + capped examples)
columns (arrayref of per-column stats)
AUTHOR
Sergio de Sousa
Issues: https://github.com/sergio-desousa/Flat-Profile/issues
LICENSE
Perl 5 (Artistic/GPL dual).