NAME
dbmapreduce - reduce all input rows with the same key
SYNOPSIS
dbmapreduce [-dMS] [-k KeyField] [-f CodeFile] [-C filter-code] [ReduceCommand [ReduceArguments...]]
DESCRIPTION
Group input data by KeyField, then reduce each group. The reduce function can be an external program given by ReduceCommand and ReduceArguments, or an Perl subroutine given by CODE.
This program thus implements Google-style map/reduce, but executed sequentially. With the -M
option, the reducer is given multiple groups (as with Google), but not in any guaranteed order (while Google guarantees they arive in lexically sorted order). However without it, a stronger invariant is provided: each reducer is given exactly one group.
By default the KeyField is the first field in the row. Unlike Hadoop streaming, the -k KeyField option can explicitly name where the key is in any column of each input row.
The KeyField used for each Reduce is added to the beginning of each row of reduce output.
The reduce function can do anything, emitting one or more output records. However, it must have the same output field separator as the input data. (In some cases this case may not occur, for example, input data with -FS and a reduce function of dbstats. This bug needs to be fixed in the future.)
Reduce functions default to be shell commands. However, with -C
, one can use arbitrary Perl code (see the -C
option below for details). the -f
option is useful to specify complex Perl code somewhere other than the command line.
Finally, as a special case, if there are no rows of input, the reducer will be invoked once with the empty value (if it's an external reducer) or with undef (if it's a subroutine). It is expected to generate the output header, and it may generate no data rows itself, or a null data row of its choosing.
Assumptions and requirements:
By default, data can be provided in arbitrary order and the program consumes O(number of unique tags) memory, and O(size of data) disk space.
With the -S option, data must arrive group by tags (not necessarily sorted), and the program consumes O(number of tags) memory and no disk space. The program will check and abort if this precondition is not met.
With two -S's, program consumes O(1) memory, but doesn't verify that the data-arrival precondition is met.
Although painful internally, the field seperators of the input and the output can be different. (Early versions of this tool prohibited such variation.)
OPTIONS
- -k or --key KeyField
-
specify which column is the key for grouping (default: the first column)
- -S or --pre-sorted
-
Assume data is already grouped by tag. Provided twice, it removes the validiation of this assertion.
- -M or --multiple-ok
-
Assume the ReduceCommand can handle multiple grouped keys, and the ReduceCommand is responsible for outputting the with each output row. (By default, a separate ReduceCommand is run for each key, and dbmapreduce adds the key to each output row.)
- -K or --pass-current-key
-
Pass the current key as an argument to the external, non-map-aware ReduceCommand. This is only done optionally since some external commands do not expect an extra argument. (Internal, non-map-aware Perl reducers are always given the current key as an argument.)
- -C FILTER-CODE or --filter-code=FILTER-CODE
-
Provide FILTER-CODE, Perl code that generates and returns a Fsdb::Filter object that implements the reduce function. The provided code should be an anonymous sub that creates a Fsdb Filter that implements the reduce object.
The reduce object will then be called with --input and --output paramters that hook it into a the reduce with queues.
One sample fragment that works is just:
dbcolstats(qw(--nolog duration))
So this command:
cat DATA/stats.fsdb | \ dbmapreduce -k experiment -C 'dbcolstats(qw(--nolog duration))'
is the same as the example
cat DATA/stats.fsdb | \ dbmapreduce -k experiment dbcolstats duration
excpet that with
-C
there is no forking and so things run faster.If
dbmapreduce
is invoked from within Perl, then one can use a code SUB as well: dbmapreduce(-k => 'experiment', -C => sub { dbcolstats(qw(--nolong duration)) });The reduce object must consume all input as a Fsdb stream, and close the output Fsdb stream. (If this assumption is not met the map/reduce will be aborted.)
For non-map-reduce-aware filters, when the filter-generator code runs,
@_[0]
will be the current key. - -f CODE-FILE or --code-file=CODE-FILE
-
Includes CODE-FILE in the program. This option is useful for more complicated perl reducer functions.
Thus, if reducer.pl has the code.
sub make_reducer { my($current_key) = @_; dbcolstats(qw(--nolog duration)); }
Then the command
cat DATA/stats.fsdb | dbmapreduce -k experiment -f reducer.pl -C make_reducer
does the same thing as the example.
- -w or --warnings
-
Enable warnings in user supplied code. Warnings are issued if an external reducer fails to consume all input. (Default to include warnings.)
This module also supports the standard fsdb options:
- -d
-
Enable debugging output.
- -i or --input InputSource
-
Read from InputSource, typically a file name, or
-
for standard input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects. - -o or --output OutputDestination
-
Write to OutputDestination, typically a file name, or
-
for standard output, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects. - --autorun or --noautorun
-
By default, programs process automatically, but Fsdb::Filter objects in Perl do not run until you invoke the run() method. The
--(no)autorun
option controls that behavior within Perl. - --help
-
Show help.
- --man
-
Show full manual.
SAMPLE USAGE
Input:
#fsdb experiment duration
ufs_mab_sys 37.2
ufs_mab_sys 37.3
ufs_rcp_real 264.5
ufs_rcp_real 277.9
Command:
cat DATA/stats.fsdb | dbmapreduce -k experiment dbcolstats duration
Output:
#fsdb experiment mean stddev pct_rsd conf_range conf_low conf_high conf_pct sum sum_squared min max n
ufs_mab_sys 37.25 0.070711 0.18983 0.6353 36.615 37.885 0.95 74.5 2775.1 37.2 37.3 2
ufs_rcp_real 271.2 9.4752 3.4938 85.13 186.07 356.33 0.95 542.4 1.4719e+05 264.5 277.9 2
# | dbmapreduce -k experiment dbstats duration
SEE ALSO
Fsdb. dbmultistats dbrowsplituniq
CLASS FUNCTIONS
A few notes about the internal structure: dbmapreduce uses two to four threads to run. An optional thread $self-
{_in_thread}> sorts the input. The main thread then reads input and groups input by key. Each group is passed to a secondary thread $self-
{_reducer_thread}> that invokes the reducer on each group and does any ouptut. If the reducer is not map-aware, then we create a final postprocessor thread that adds the key back to the output. Either the reducer or the postprocessor thread do output.
new
$filter = new Fsdb::Filter::dbmapreduce(@arguments);
Create a new dbmapreduce object, taking command-line arguments.
set_defaults
$filter->set_defaults();
Internal: set up defaults.
parse_options
$filter->parse_options(@ARGV);
Internal: parse command-line arguments.
setup
$filter->setup();
Internal: setup, parse headers.
_setup_reducer
_setup_reducer
(internal) One thread that runs the reducer thread and produces output. _reducer_queue
is sends the new key, then a Fsdb stream, then EOF (undef) for each group. We setup the output, suppress all but the first header, and add in the keys if necessary.
_multikey_aware_reducer
_multikey_aware_reducer
Handle a map-aware reduce process. We assume the caller suppresses signaling about key transition, so we just run the user's reducer in our thread.
_multikey_ignorant_reducer
_multikey_ignorant_reducer
Handle a map-ignorant reduce process. We handle multiple keys. We create a postprocessor thread to add the key back in to the output and do our finish(). We then become a middle-process, just handling key transitions and invoking new reducers.
_multikey_ignorant_postprocessor
_multikey_ignorant_postprocessor
Post-process a map-ignorant reduce process. Add back in our key to each output. The one scary bit is we reuse the reducer to postprocessor queue past our EOF signal (undef).
_open_new_key
_open_new_key
(internal)
_close_old_key
_close_old_key
Internal, finish a tag.
_key_to_string
$self->_key_to_string($key)
Convert a key (maybe undef) to a string for status messages.
run
$filter->run();
Internal: run over each rows.
finish
$filter->finish();
Internal: write trailer.
AUTHOR and COPYRIGHT
Copyright (C) 1991-2011 by John Heidemann <johnh@isi.edu>
This program is distributed under terms of the GNU general public license, version 2. See the file COPYING with the distribution for details.