NAME
App::CSVUtils - CLI utilities related to CSV
VERSION
This document describes version 0.999_1 of App::CSVUtils (from Perl distribution App-CSVUtils), released on 2022-12-21.
DESCRIPTION
This distribution contains the following CLI utilities:
FUNCTIONS
gen_csv_util
Usage:
gen_csv_util(%args) -> bool
Generate a CSV utility.
This routine is used to generate a CSV utility in the form of a Rinci function (code and metadata). You can then produce a CLI from the Rinci function simply using Perinci::CmdLine::Gen or, if you use Dist::Zilla, Dist::Zilla::Plugin::GenPericmdScript or, if on the command-line, gen-pericmd-script.
To create a CSV utility, you specify a name (e.g. csv_dump; must be a valid unqualified Perl identifier/function name) and optionally summary, description, and other metadata like links or even add_meta_props. Then you specify one or more of on_* or before_* or after_* arguments to supply handlers (coderefs) for your CSV utility at various hook points.
THE HOOKS
All code for hooks should accept a single argument r. r is a stash (hashref) of various data, the keys of which will depend on which hook point being called. You can also add more keys to store data or for flow control (see hook documentation below for more details).
The order of the hooks, in processing chronological order:
on_begin
Called when utility begins, before reading CSV. You can use this hook e.g. to process arguments, set output filenames (if you allow custom output filenames).
before_read_input
Called before opening any input CSV file. This hook is still called even if your utility sets
reads_csvto false.At this point, the
input_filenamesstash key (as well as other keys likeinput_filename,input_filenum, etc) has not been set. You can use this hook e.g. to set a custominput_filenames.before_open_input_files
Called before an input CSV file is about to be opened, including for stdin (
-). You can use this hook e.g. to check/preprocess input file. Flow control is available by setting$r->{wants_skip_files}to skip reading all the input file and go directly to theafter_read_inputhook.before_open_input_file
Called before an input CSV file is about to be opened, including for stdin (
-). For the first file, called afterbefore_open_input_filehook. You can use this hook e.g. to check/preprocess input file. Flow control is available by setting$r->{wants_skip_file}to skip reading a single input file and go to the next file, or$r->{wants_skip_files}to skip reading the rest of the files and go directly to theafter_read_inputhook.on_input_header_row
Called when receiving header row. Will be called for every input file, and called even when user specify
--no-input-header, in which case the header row will be the generated["field1", "field2", ...]. You can use this hook e.g. to add/remove/rearrange fields.Note that for stdin input (
-), if reading is repeated via flow control, then this hook will return the previously saved header row of stdin the first time it is read. This is done because we can't seek stdin back to the beginning.on_input_data_row
Called when receiving each data row. You can use this hook e.g. to modify the row or print output (for line-by-line transformation or filtering).
Note that for stdin input (
-), if reading is repeated via flow control, then this reading will continue from the last time it is last read since we can't seek stdin back to the beginning. That is, if the first time the stdin is exhausted, then the subsequent times, reading will reach EOF immediately.after_close_input_file
Called after each input file is closed, including for stdin (
-) (although for stdin, the handle is not actually closed). Flow control is possible by setting$r->{wants_repeat_file}to repeat reading the file or$r->{wants_skip_files}to skip reading the rest of the files and go straight to theafter_close_input_fileshook.after_close_input_files
Called after the last input file is closed, after the last
after_close_input_filehook, including for stdin (-) (although for stdin, the handle is not actually closed). Flow control is possible by setting$r->{wants_repeat_files}to repeat reading all the input files.after_read_input
Called after the last row of the last CSV file is read and the last file is closed. This hook is still called, if you set
reads_csvoption to false. At this point the stash keys related to CSV reading have all been cleared, includinginput_filenames,input_filename,input_fh, etc.You can use this hook e.g. to print output if you buffer the output.
on_end
Called when utility is about to exit. You can use this hook e.g. to return the final result.
THE STASH
The common keys that r will contain:
gen_args, hash. The arguments used to generate the CSV utility.util_args, hash. The arguments that your CSV utility accepts. Parsed from command-line arguments (or configuration files, or environment variables).name, str. The name of the CSV utility. Which can also be retrieved viagen_args.code_print, coderef. Routine provided for you to print something. Accepts a string. Takes care of opening the output files for you.code_print_row, coderef. Routine provided for you to print a data row. You pass the row (either arrayref or hashref). Takes care of opening the output files for you, as well as printing header row the first time, if needed.code_print_header_row, coderef. Routine provided for you to print header row. You don't need to pass any arguments. Will only print the header row once per output file if output header is enabled, even if called multiple times.
If you are accepting CSV data, the following keys will also be available (in on_input_header_row and on_input_data_row hooks):
input_parser, a Text::CSV_XS instance for input parsing.input_filenames, array of str.input_filename, str. The name of the current input file being read (-if reading from stdin).input_filenum, uint. The number of the current input file, 1 being the first file, 2 for the second, and so on.input_fh, the handle to the current file being read.input_rownum, uint. The number of rows that have been read (reset after each input file).input_data_rownum, uint. The number of data rows that have been read (reset after each input file). This will be equal toinput_rownumless 1 if input file has header.input_row, aos (array of str). The current input CSV row as an arrayref.input_row_as_hashref, hos (hash of str). The current input CSV row as a hashref, with field names as hash keys and field values as hash values. This will only be calculated if utility wants it. Utility can express so by setting$r->{wants_input_row_as_hashref}to true, e.g. in theon_beginhook.input_header_row_count, uint. Contains the number of actual header rows that have been read. If CLI user specifies--no-input-header, this will stay at zero. Will be reset for each CSV file.input_data_row_count, int. Contains the number of actual data rows that have read. Will be reset for each CSV file.
If you are outputting CSV, the following keys will be available:
output_emitter, a Text::CSV_XS instance for output.output_filenames, array of str.output_filename, str, name of current output file.output_filenum, uint, the number of the current output file, 1 being the first file, 2 for the second, and so on.output_fh, handle to the current output file.output_rownum, uint. The number of rows that have been outputted (reset after each output file).output_data_rownum, uint. The number of data rows that have been outputted (reset after each output file). This will be equal toinput_rownumless 1 if input file has header.
For other hook-specific keys, see the documentation for associated hook point.
ACCEPTING ADDITIONAL COMMAND-LINE OPTIONS/ARGUMENTS
As mentioned above, you will get additional command-line options/arguments in $r->{util_args} hashref. Some options/arguments are already added by gen_csv_util, e.g. input_filename or input_filenames along with input_sep_char, etc (when your utility declares reads_csv), output_filename or output_filenames along with overwrite, output_sep_char, etc (when your utility declares writes_csv).
If you want to accept additional arguments/options, you specify them in add_args (hashref, with key being Each option/argument has to be specified first via add_args (as hashref, with key being argument name and value the argument specification as defined in Rinci::function)). Some argument specifications have been defined in App::CSVUtils and can be used. See existing utilities for examples.
READING CSV DATA
To read CSV data, normally your utility would provide handler for the on_input_data_row hook and sometimes additionally on_input_header_row.
OUTPUTTING STRING OR RETURNING RESULT
To output string, usually you call the provided routine $r->{code_print}. This routine will open the output files for you.
You can also return enveloped result directly by setting $r->{result}.
OUTPUTTING CSV DATA
To output CSV data, usually you call the provided routine $r->{code_print_row}. This routine accepts a row (arrayref or hashref). This routine will open the output files for you when needed, as well as print header row automatically.
You can also buffer rows from input to e.g. $r->{output_rows}, then call $r->{code_print_row} repeatedly in the after_read_input hook to print all the rows.
READING MULTIPLE CSV FILES
To read multiple CSV files, you first specify reads_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. If you want to do something before/after each input file, you can also supply handler for before_open_input_file or after_close_input_file.
WRITING TO MULTIPLE CSV FILES
Similarly, to write to many CSv files, you first specify writes_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. To switch to the next file, set $r->{wants_switch_to_next_output_file} to true, in which case the next call to $r->{code_print_row} will close the current file and open the next file.
CHANGING THE OUTPUT FIELDS
When calling $r->{code_print_row}, you can output whatever fields you want. By convention, you can set $r->{output_fields} and $r->{output_fields_idx} to let other handlers know about the output fields. For example, see the implementation of csv-concat.
This function is not exported by default, but exportable.
Arguments ('*' denotes required arguments):
add_args => hash
(No description)
add_args_rels => hash
(No description)
add_meta_props => hash
Add additional Rinci function metadata properties.
after_close_input_file => code
(No description)
after_close_input_files => code
(No description)
after_read_input => code
(No description)
before_open_input_file => code
(No description)
before_open_input_files => code
(No description)
before_read_input => code
(No description)
description => str
(No description)
examples => array
(No description)
links => array[hash]
(No description)
name* => perl::identifier::unqualified_ascii
(No description)
on_begin => code
(No description)
on_end => code
(No description)
on_input_data_row => code
(No description)
on_input_header_row => code
(No description)
reads_csv => bool (default: 1)
Whether utility reads CSV data.
reads_multiple_csv => bool
Whether utility accepts CSV data.
Setting this option to true will implicitly set the
reads_csvoption to true, obviously.summary => str
(No description)
writes_csv => bool (default: 1)
Whether utility writes CSV data.
writes_multiple_csv => bool
Whether utility outputs CSV data.
Setting this option to true will implicitly set the
writes_csvoption to true, obviously.
Return value: (bool)
compile_eval_code
Usage:
$coderef = compile_eval_code($str, $label);
Compile string code $str to coderef in 'main' package, without use strict or use warnings. Die on compile error.
FAQ
My CSV does not have a header?
Use the --no-header option. Fields will be named field1, field2, and so on.
My data is TSV, not CSV?
Use the --tsv option.
I have a big CSV and the utilities are too slow or eat too much RAM!
These utilities are not (yet) optimized, patches welcome. If your CSV is very big, perhaps a C-based solution is what you need.
HOMEPAGE
Please visit the project's homepage at https://metacpan.org/release/App-CSVUtils.
SOURCE
Source repository is at https://github.com/perlancar/perl-App-CSVUtils.
SEE ALSO
Similar CLI bundles for other format
App::TSVUtils, App::LTSVUtils, App::SerializeUtils.
Other CSV-related utilities
xls2csv and xlsx2csv from Spreadsheet::Read
import-csv-to-sqlite from App::SQLiteUtils
Query CSV with SQL using fsql from App::fsql
Other non-Perl-based CSV utilities
Python
csvkit, https://csvkit.readthedocs.io/en/latest/
AUTHOR
perlancar <perlancar@cpan.org>
CONTRIBUTOR
Adam Hopkins <violapiratejunky@gmail.com>
CONTRIBUTING
To contribute, you can send patches by email/via RT, or send pull requests on GitHub.
Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via:
% prove -l
If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, Pod::Weaver::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond that are considered a bug and can be reported to me.
COPYRIGHT AND LICENSE
This software is copyright (c) 2022, 2021, 2020, 2019, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
BUGS
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=App-CSVUtils
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.