NAME

Flat::Schema - Deterministic schema contracts for flat files

WHY THIS EXISTS (IN ONE PARAGRAPH)

In real ETL work, yesterday's CSV becomes today's "contract" whether you meant it or not. Flat::Schema makes that contract explicit: generate a deterministic schema from what you observed, record ambiguity as issues, and give the next step (validation) something stable to enforce.

SYNOPSIS

Basic usage:

use Flat::Profile;
use Flat::Schema;

my $profile = Flat::Profile->profile_file(
    file => "data.csv",
);

my $schema = Flat::Schema->from_profile(
    profile => $profile,
);

print Flat::Schema->new()->to_json(schema => $schema);

With overrides:

my $schema = Flat::Schema->from_profile(
    profile   => $profile,
    overrides => [
        { column_index => 0, set => { type => 'integer', nullable => 0 } },
        { column_index => 3, set => { name => 'created_at', type => 'datetime' } },
    ],
);

DESCRIPTION

Flat::Schema consumes reports produced by Flat::Profile and generates a deterministic, inspectable schema contract describing what tabular data should look like.

It is the second module in the Flat::* series:

The schema is a canonical Perl data structure that:

REAL-WORLD USE CASES (THE STUFF YOU ACTUALLY DO)

1) Vendor “helpfully” changes a column (integer → text)

You ingest daily files and one day a numeric column starts containing values like N/A, unknown, or ERR-17. Your pipeline should not silently coerce this into zero or drop rows.

Workflow:

  1. Profile last-known-good
  2. Generate schema (your contract)
  3. Validate future drops against the schema

A typical override when you decide "we accept this as string now":

my $schema = Flat::Schema->from_profile(
    profile   => $profile,
    overrides => [
        { column_index => 7, set => { type => 'string' } },
    ],
);

Flat::Schema will record that the override conflicts with what it inferred, and that record is useful during incident review.

2) Columns that are “nullable in real life” even if today they are not

Data often arrives complete in a sample window and then starts missing values in production. In v1, nullability is intentionally simple:

nullable = true iff null_count > 0

If you know a field is nullable even if today it isn't, force it:

overrides => [
    { column_index => 2, set => { nullable => 1 } },  # allow missing later
],

3) Timestamp confusion: date vs datetime vs “whatever the exporter did”

When temporal evidence mixes, Flat::Schema chooses predictability over cleverness.

This prevents “maybe parseable” data from becoming quietly wrong later.

4) “Header row roulette” and naming cleanup

You may get headers like Customer ID, customer_id, CUSTID, or no header at all. Schema stores both:

If you need normalized naming for downstream systems:

overrides => [
    { column_index => 0, set => { name => 'customer_id' } },
],

5) Reproducible artifacts for tickets, audits, and “what changed?”

Sometimes the most important feature is being able to paste the schema into a ticket, diff it in Git, or keep it as a build artifact.

Flat::Schema’s serializers are deterministic by design. If the schema changes, it is because the inputs changed (profile or overrides), not because hash order shifted.

SCHEMA STRUCTURE (AT A GLANCE)

A generated schema contains:

{
    schema_version => 1,
    generator      => { name => "Flat::Schema", version => "0.01" },
    profile        => { ... },
    columns        => [ ... ],
    issues         => [ ... ],
}

Each column contains:

{
    index      => 0,
    name       => "id",
    type       => "integer",
    nullable   => 0,
    length     => { min => 1, max => 12 },  # optional
    overrides  => { ... },                  # optional
    provenance => {
        basis         => "profile",
        rows_observed => 1000,
        null_count    => 0,
        null_rate     => { num => 0, den => 1000 },
        overrides     => [ "type", "nullable" ],  # optional
    },
}

TYPE INFERENCE (v1)

Type inference is based solely on evidence provided by Flat::Profile.

Scalar widening order:

boolean → integer → number → string

Temporal handling:

date + datetime → datetime
temporal + non-temporal → string (with warning)

Mixed evidence is widened and recorded as an issue.

NULLABILITY INFERENCE (v1)

Rules:

USER OVERRIDES (v1)

Overrides are applied after inference.

Supported fields:

Overrides:

Overrides referencing unknown columns cause a hard error.

DETERMINISTIC SERIALIZATION

Flat::Schema includes built-in deterministic JSON and YAML serializers.

Same input profile + same overrides → identical JSON/YAML.

This is required for reproducible pipelines and meaningful diffs.

STATUS

Implemented in v1:

Future releases may expand the type lattice, constraint modeling, and schema evolution.

AUTHOR

Sergio de Sousa sergio@serso.com

LICENSE

This library is free software; you may redistribute it and/or modify it under the same terms as Perl itself.