NAME
MCE::Grep - Parallel grep model similar to the native grep function
VERSION
This document describes MCE::Grep version 1.901
SYNOPSIS
## Exports mce_grep, mce_grep_f, and mce_grep_s
use
MCE::Grep;
## Array or array_ref
my
@a
= mce_grep {
$_
% 5 == 0 } 1..10000;
my
@b
= mce_grep {
$_
% 5 == 0 } \
@list
;
## Important; pass an array_ref for deeply input data
my
@c
= mce_grep {
$_
->[1] % 2 == 0 } [ [ 0, 1 ], [ 0, 2 ], ... ];
my
@d
= mce_grep {
$_
->[1] % 2 == 0 } \
@deeply_list
;
## File path, glob ref, IO::All::{ File, Pipe, STDIO } obj, or scalar ref
## Workers read directly and not involve the manager process
my
@e
= mce_grep_f { /pattern/ }
"/path/to/file"
;
# efficient
## Involves the manager process, therefore slower
my
@f
= mce_grep_f { /pattern/ }
$file_handle
;
my
@g
= mce_grep_f { /pattern/ }
$io
;
my
@h
= mce_grep_f { /pattern/ } \
$scalar
;
## Sequence of numbers (begin, end [, step, format])
my
@i
= mce_grep_s {
%_
* 3 == 0 } 1, 10000, 5;
my
@j
= mce_grep_s {
%_
* 3 == 0 } [ 1, 10000, 5 ];
my
@k
= mce_grep_s {
%_
* 3 == 0 } {
begin
=> 1,
end
=> 10000,
step
=> 5,
format
=>
undef
};
DESCRIPTION
This module provides a parallel grep implementation via Many-Core Engine. MCE incurs a small overhead due to passing of data. A fast code block will run faster natively. However, the overhead will likely diminish as the complexity increases for the code.
my
@m1
=
grep
{
$_
% 5 == 0 } 1..1000000;
## 0.065 secs
my
@m2
= mce_grep {
$_
% 5 == 0 } 1..1000000;
## 0.194 secs
Chunking, enabled by default, greatly reduces the overhead behind the scene. The time for mce_grep below also includes the time for data exchanges between the manager and worker processes. More parallelization will be seen when the code incurs additional CPU time.
my
@m1
=
grep
{ /[2357][1468][9]/ } 1..1000000;
## 0.353 secs
my
@m2
= mce_grep { /[2357][1468][9]/ } 1..1000000;
## 0.218 secs
Even faster is mce_grep_s; useful when input data is a range of numbers. Workers generate sequences mathematically among themselves without any interaction from the manager process. Two arguments are required for mce_grep_s (begin, end). Step defaults to 1 if begin is smaller than end, otherwise -1.
my
@m3
= mce_grep_s { /[2357][1468][9]/ } 1, 1000000;
## 0.165 secs
Although this document is about MCE::Grep, the MCE::Stream module can write results immediately without waiting for all chunks to complete. This is made possible by passing the reference to an array (in this case @m4 and @m5).
my
@m4
; mce_stream \
@m4
,
sub
{ /[2357][1468][9]/ }, 1..1000000;
## Completed in 0.203 secs. This is amazing considering the
## overhead for passing data between the manager and workers.
my
@m5
; mce_stream_s \
@m5
,
sub
{ /[2357][1468][9]/ }, 1, 1000000;
## Completed in 0.120 secs. Like with mce_grep_s, specifying a
## sequence specification turns out to be faster due to lesser
## overhead for the manager process.
A common scenario is grepping for pattern(s) inside a massive log file. Notice how parallelism increases as complexity increases for the pattern. Testing was done against a 300 MB file containing 250k lines.
use
MCE::Grep;
my
@m
;
open
my
$LOG
,
"<"
,
"/path/to/log/file"
or
die
"$!\n"
;
@m
=
grep
{ /pattern/ } <
$LOG
>;
## 0.756 secs
@m
=
grep
{ /foobar|[2357][1468][9]/ } <
$LOG
>;
## 24.681 secs
## Parallelism with mce_grep. This involves the manager process
## due to processing a file handle.
@m
= mce_grep { /pattern/ } <
$LOG
>;
## 0.997 secs
@m
= mce_grep { /foobar|[2357][1468][9]/ } <
$LOG
>;
## 7.439 secs
## Even faster with mce_grep_f. Workers access the file directly
## with zero interaction from the manager process.
my
$LOG
=
"/path/to/file"
;
@m
= mce_grep_f { /pattern/ }
$LOG
;
## 0.112 secs
@m
= mce_grep_f { /foobar|[2357][1468][9]/ }
$LOG
;
## 6.840 secs
PARSING HUGE FILES
The MCE::Grep module lacks an optimization for quickly determining if a match is found from not knowing the pattern inside the code block. Use the following snippet as a template to achieve better performance. Also, take a look at examples/egrep.pl, included with the distribution.
use
MCE::Loop;
MCE::Loop->init(
max_workers
=> 8,
use_slurpio
=> 1
);
my
$pattern
=
'karl'
;
my
$hugefile
=
'very_huge.file'
;
my
@result
= mce_loop_f {
my
(
$mce
,
$slurp_ref
,
$chunk_id
) =
@_
;
## Quickly determine if a match is found.
## Process slurped chunk only if true.
if
(
$$slurp_ref
=~ /
$pattern
/m) {
my
@matches
;
## The following is fast on Unix. Performance degrades
## drastically on Windows beyond 4 workers.
open
my
$MEM_FH
,
'<'
,
$slurp_ref
;
binmode
$MEM_FH
,
':raw'
;
while
(<
$MEM_FH
>) {
push
@matches
,
$_
if
(/
$pattern
/); }
close
$MEM_FH
;
## Therefore, use the following construct on Windows.
while
(
$$slurp_ref
=~ /([^\n]+\n)/mg ) {
my
$line
= $1;
# save $1 to not lose the value
push
@matches
,
$line
if
(
$line
=~ /
$pattern
/);
}
## Gather matched lines.
MCE->gather(
@matches
);
}
}
$hugefile
;
join
(
''
,
@result
);
OVERRIDING DEFAULTS
The following list options which may be overridden when loading the module.
use
MCE::Grep
max_workers
=> 4,
# Default 'auto'
chunk_size
=> 100,
# Default 'auto'
tmp_dir
=>
"/path/to/app/tmp"
,
# $MCE::Signal::tmp_dir
freeze
=> \
&encode_sereal
,
# \&Storable::freeze
thaw
=> \
&decode_sereal
,
# \&Storable::thaw
init_relay
=> 0,
# Default undef; MCE 1.882+
use_threads
=> 0,
# Default undef; MCE 1.882+
;
From MCE 1.8 onwards, Sereal 3.015+ is loaded automatically if available. Specify Sereal => 0
to use Storable instead.
CUSTOMIZING MCE
The init function accepts a hash of MCE options. The gather option, if specified, is ignored due to being used internally by the module.
In scalar context (API available since 1.897), call MCE::Grep-
finish> automatically upon leaving the scope or program.
use
MCE::Grep;
my
$guard
= MCE::Grep->init(
chunk_size
=> 1,
max_workers
=> 4,
user_begin
=>
sub
{
"## "
, MCE->wid,
" started\n"
;
},
user_end
=>
sub
{
"## "
, MCE->wid,
" completed\n"
;
}
);
my
@a
= mce_grep {
$_
% 5 == 0 } 1..100;
"\n"
,
"@a"
,
"\n"
;
-- Output
## 2 started
## 3 started
## 1 started
## 4 started
## 3 completed
## 4 completed
## 1 completed
## 2 completed
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
API DOCUMENTATION
Input data may be defined using a list or an array reference. Unlike MCE::Loop, Flow, and Step, specifying a hash reference as input data isn't allowed.
## Array or array_ref
my
@a
= mce_grep { /[2357]/ } 1..1000;
my
@b
= mce_grep { /[2357]/ } \
@list
;
## Important; pass an array_ref for deeply input data
my
@c
= mce_grep {
$_
->[1] =~ /[2357]/ } [ [ 0, 1 ], [ 0, 2 ], ... ];
my
@d
= mce_grep {
$_
->[1] =~ /[2357]/ } \
@deeply_list
;
## Not supported
my
@z
= mce_grep { ... } \
%hash
;
The fastest of these is the /path/to/file. Workers communicate the next offset position among themselves with zero interaction by the manager process.
IO::All
{ File, Pipe, STDIO } is supported since MCE 1.845.
my
@c
= mce_grep_f { /pattern/ }
"/path/to/file"
;
# faster
my
@d
= mce_grep_f { /pattern/ }
$file_handle
;
my
@e
= mce_grep_f { /pattern/ }
$io
;
# IO::All
my
@f
= mce_grep_f { /pattern/ } \
$scalar
;
- MCE::Grep->run_seq ( sub { code }, $beg, $end [, $step, $fmt ] )
- mce_grep_s { code } $beg, $end [, $step, $fmt ]
Sequence may be defined as a list, an array reference, or a hash reference. The functions require both begin and end values to run. Step and format are optional. The format is passed to sprintf (% may be omitted below).
my
(
$beg
,
$end
,
$step
,
$fmt
) = (10, 20, 0.1,
"%4.1f"
);
my
@f
= mce_grep_s { /[1234]\.[5678]/ }
$beg
,
$end
,
$step
,
$fmt
;
my
@g
= mce_grep_s { /[1234]\.[5678]/ } [
$beg
,
$end
,
$step
,
$fmt
];
my
@h
= mce_grep_s { /[1234]\.[5678]/ } {
begin
=>
$beg
,
end
=>
$end
,
step
=>
$step
,
format
=>
$fmt
};
An iterator reference may be specified for input_data. Iterators are described under section "SYNTAX for INPUT_DATA" at MCE::Core.
my
@a
= mce_grep {
$_
% 3 == 0 } make_iterator(10, 30, 2);
MANUAL SHUTDOWN
Workers remain persistent as much as possible after running. Shutdown occurs automatically when the script terminates. Call finish when workers are no longer needed.
use
MCE::Grep;
MCE::Grep->init(
chunk_size
=> 20,
max_workers
=>
'auto'
);
my
@a
= mce_grep { ... } 1..100;
MCE::Grep->finish;
INDEX
AUTHOR
Mario E. Roy, <marioeroy AT gmail DOT com>