=head1 NAME
PDL::ParallelCPU - Parallel Processor MultiThreading Support in PDL (Experimental)
=head1 DESCRIPTION
PDL
has
support (currently experimental)
for
splitting up numerical processing
between multiple parallel processor threads (or pthreads) using the I<set_autopthread_targ>
and I<set_autopthread_size> functions.
This can improve processing performance (by greater than 2-4X in most cases)
by taking advantage of multi-core and/or multi-processor machines.
=head1 SYNOPSIS
set_autopthread_targ(4);
set_autopthread_size(5);
$x
= zeroes(5000,5000);
$y
=
$x
+ 5;
$actualPthreads
= get_autopthread_actual();
=head1 Terminology
The
use
of the term I<threading> can be confusing
with
PDL, because it can refer to I<PDL threading>,
as
defined
in the L<PDL::Threading> docs, or to I<processor multi-threading>.
To reduce confusion
with
the existing PDL threading terminology, this document uses
B<pthreading> to refer to I<processor multi-threading>, which is the
use
of multiple processor threads
to
split
up numerical processing into parallel operations.
=head1 Functions that control PDL PThreads
This is a brief listing and description of the PDL pthreading functions, see the L<PDL::Core> docs
for
detailed information.
=over 5
=item set_autopthread_targ
Set the target number of processor-threads (pthreads)
for
multi-threaded processing. Setting auto_pthread_targ
to 0 means that
no
pthreading will occur.
See L<PDL::Core|set_autopthread_targ>
for
details.
=item set_autopthread_size
Set the minimum size (in Meg-elements or 2**20 elements) of the largest PDL involved in a function where auto-pthreading will
be performed. For small PDLs, it probably isn't worth starting multiple pthreads, so this function
is used to define a minimum threshold where auto-pthreading won't be attempted.
See L<PDL::Core|set_autopthread_size>
for
details.
=item get_autopthread_actual
Get the actual number of pthreads executed
for
the
last
pdl processing function.
See L<PDL::get_autopthread_actual>
for
details.
=back
=head1 Global Control of PDL PThreading using Environment Variables
PDL PThreading can be globally turned on, without modifying existing code by setting
environment variables B<PDL_AUTOPTHREAD_TARG> and B<PDL_AUTOPTHREAD_SIZE>
before
running a PDL script.
These environment variables are checked
when
PDL starts up and calls to I<set_autopthread_targ> and
I<set_autopthread_size> functions made
with
the environment variable's
values
.
For example,
if
the environment var B<PDL_AUTOPTHREAD_TARG> is set to 3, and B<PDL_AUTOPTHREAD_SIZE> is
set to 10, then any pdl script will run as
if
the following lines were at the top of the file:
set_autopthread_targ(3);
set_autopthread_size(10);
=head1 How It Works
The auto-pthreading process works by analyzing threaded array dimensions in PDL operations
and splitting up processing based on the thread dimension sizes and desired number of
pthreads (i.e. the pthread target or pthread_targ). The offsets and increments that PDL uses to step
thru the data in memory are modified
for
each
pthread so
each
one sees a different set of data
when
performing processing.
B<Example>
$x
= sequence(20,4,3);
set_autopthread_targ(2);
set_autopthread_size(0);
$c
= maximum(
$x
);
For the above example, the I<maximum> function
has
a signature of C<(a(n); [o]c())>, which means that the first
dimension of
$x
(size 20) is a I<Core> dimension of the I<maximum> function. The other dimensions of
$x
(size 4,3)
are I<threaded> dimensions (i.e. will be threaded-over in the I<maximum> function.
The auto-pthreading algorithm examines the threaded dims of size (4,3) and picks the 4 dimension,
since it is evenly divisible by the autopthread_targ of 2. The processing of the maximum function is then
split
into two pthreads on the size-4 dimension,
with
dim indexes 0,2 processed by one pthread
and dim indexes 1,3 processed by the other pthread.
=head1 Limitations
=head2 Must have POSIX Threads Enabled
Auto-PThreading only works
if
your PDL installation was compiled
with
POSIX threads enabled. This is normally
the case
if
you are running on linux, or other unix variants.
=head2 Non-Threadsafe Code
Not all the libraries that PDL intefaces to are thread-safe, i.e. they aren't written to operate
in a multi-threaded environment without crashing or causing side-effects. Some examples in the PDL
core is the I<fft> function and the I<pnmout> functions.
To operate properly
with
these types of functions, the PPCode flag B<NoPthread>
has
been introduced to indicate
a function as I<not> being pthread-safe. See L<PDL::PP> docs
for
details.
=head2 Size of PDL Dimensions and PThread Target
Due to the way a PDL is
split
-up
for
operation using multiple pthreads, the size of a dimension
must be evenly divisible by the pthread target. For example,
if
a PDL
has
threaded dimension sizes
of (4,3,3) and the I<auto_pthread_targ>
has
been set to 2, then the first threaded dimension (size 4) will
be picked to be
split
up into two pthreads of size 2 and 2. However,
if
the threaded dimension sizes are
(3,3,3) and the I<auto_pthread_targ> is still 2, then pthreading won't occur, because
no
threaded dimensions
are divisible by 2.
The algorithm that picks the actual number of pthreads
has
some smarts (but could probably be improved)
to adjust down from the I<auto_pthread_targ> to get a number of pthreads that can evenly divide one of the
threaded dimensions. For example,
if
a PDL
has
threaded dimension sizes of (9,2,2) and the
I<auto_pthread_targ> is 4, the algorithm will see that
no
dimension is divisible by 4, then adjust
down the target to 3, resulting in splitting up the first threaded dimension (size 9) into 3 pthreads.
=head2 Speed improvement might be less than you expect.
If you have a 8 core machine and call I<auto_pthread_targ>
with
8 to generate 8 parallel pthreads, you
probably won't get a 8X improvement in speed, due to memory bandwidth issues. Even though you have 8 separate
CPUs crunching away on data, you will have (
for
most common machine architectures) common RAM that now becomes
your bottleneck. For simple calculations (e.g simple additions) you can run into a performance limit at about
4 pthreads. For more complex calculations the limit will be higher.
=head1 COPYRIGHT
Copyright 2011 John Cerney. You can distribute and/or
modify this document under the same terms as the current Perl license.