NAME
PDL::Parallel::threads - sharing PDL data between Perl threads
SYNOPSIS
use PDL;
use PDL::Parallel::threads qw(retrieve_pdls share_pdls);
# Technically, this is pulled in for you by PDL::Parallel::threads,
# but using it in your code pulls in the named functions like async.
use threads;
# Also, technically, you can use PDL::Parallel::threads with
# single-threaded programs, and even with perl's not compiled
# with thread support.
# Create some shared PDL data
zeroes(1_000_000)->share_as('My::shared::data');
# Create an ndarray and share its data
my $test_data = sequence(100);
share_pdls(some_name => $test_data); # allows multiple at a time
$test_data->share_as('some_name'); # or use the PDL method
# Or work with memory mapped files:
share_pdls(other_name => 'mapped_file.dat');
# Kick off some processing in the background
async {
my ($shallow_copy, $mapped_ndarray)
= retrieve_pdls('some_name', 'other_name');
# thread-local memory
my $other_ndarray = sequence(20);
# Modify the shared data:
$shallow_copy++;
};
# ... do some other stuff ...
# Rejoin all threads
for my $thr (threads->list) {
$thr->join;
}
use PDL::NiceSlice;
print "First ten elements of test_data are ",
$test_data(0:9), "\n";
DESCRIPTION
This module provides a means to share PDL data between different Perl threads. In contrast to PDL's posix thread support (see PDL::ParallelCPU), this module lets you work with Perl's built-in threading model. In contrast to Perl's threads::shared, this module focuses on sharing data, not variables.
Because this module focuses on sharing data, not variables, it does not use attributes to mark shared variables. Instead, you must explicitly share your data by using the "share_pdls" function or "share_as" PDL method that this module introduces. Those both associate a name with your data, which you use in other threads to retrieve the data with the "retrieve_pdls". Once your thread has access to the ndarray data, any modifications will operate directly on the shared memory, which is exactly what shared data is supposed to do. When you are completely done using a piece of data, you need to explicitly remove the data from the shared pool with the "free_pdls" function. Otherwise your data will continue to consume memory until the originating thread terminates, or put differently, you will have a memory leak.
This module lets you share two sorts of ndarray data. You can share data for an ndarray that is based on actual physical memory, such as the result of "zeroes" in PDL::Core. You can also share data using memory mapped files. (Note: PDL v2.4.11 and higher support memory mapped ndarrays on all major platforms, including Windows.) There are other sorts of ndarrays whose data you cannot share. You cannot directly share ndarrays that have not been physicalised, though a simple "make_physical" in PDL::Core, "sever" in PDL::Core, or "copy" in PDL::Core will give you an ndarray based on physical memory that you can share. Also, certain functions wrap external data into ndarrays so you can manipulate them with PDL methods. For example, see "plmap" in PDL::Graphics::PLplot and "plmeridians" in PDL::Graphics::PLplot. These you cannot share directly, but making a physical copy with "copy" in PDL::Core will give you something that you can safely share.
Physical Memory
The mechanism by which this module achieves data sharing of physical memory is remarkably cheap. It's even cheaper then a simple affine transformation. The sharing works by creating a new shell of an ndarray for each call to "retrieve_pdls" and setting that ndarray's memory structure to point back to the same locations of the original (shared) ndarray. This means that you can share ndarrays that are created with standard constructors like "zeroes" in PDL::Core, "pdl" in PDL::Core, and "sequence" in PDL::Basic, or which are the result of operations and function evaluations for which there is no data flow, such as "cat" in PDL::Core (but not "dog" in PDL::Core), arithmetic, "copy" in PDL::Core, and "sever" in PDL::Core. When in doubt, sever
your ndarray before sharing and everything should work.
There is an important nuance to sharing physical memory: The memory will always be freed when the originating thread terminates, even if it terminated cleanly. This can lead to segmentation faults when one thread exits and frees its memory before another thread has had a chance to finish calculations on the shared data. It is best to use barrier synchronization to avoid this (via PDL::Parallel::threads::SIMD), or to share data solely from your main thread.
Memory Mapped Data
The mechanism by which this module achieves data sharing of memory mapped files is exactly how you would share data across threads or processes using PDL::IO:::FastRaw. However, there are a couple of important caveats to using memory mapped ndarrays with PDL::Parallel::threads
. First, you must load PDL::Parallel::threads
before loading PDL::IO::FastRaw:
# Good
use PDL::Parallel::threads qw(retrieve_pdls);
use PDL::IO::FastRaw;
# BAD
use PDL::IO::FastRaw;
use PDL::Parallel::threads qw(retrieve_pdls);
This is necessary because PDL::Parallel::threads
has to perform a few internal tweaks to PDL::IO::FastRaw before you load its functions into your local package.
Furthermore, any memory mapped files must have header files associated with the data file. That is, if the data file is foo.dat, you must have a header file called foo.dat.hdr. This is overly restrictive and in the future the module may perform more internal tweaks to PDL::IO::FastRaw to store whatever options were used to create the original ndarray. But for the meantime, be sure that you have a header file for your raw data file.
There is much less nuance to sharing memory mapped data across threads compared to directly sharing physical memory as discussed above. When you ask for a thread-local copy of that file, you get your very own fully baked memory-mapped ndarray that gets freed when the ndarray goes out of scope. This means you cannot get memory leaks. Furthermore, the data underlying the ndarray come from a file and not from a shared space in RAM. That means there is no "originating thread", and you cannot trigger a segmentation fault by trying to access memory that has disappeared, because... there's nothing that can disappear.
You may ask yourself why loading this module must come before loading the FastRaw module. The reason is that PDL::IO::FastRaw exports a few methods to your namespace, and PDL::Parallel::threads
modifies one of those exported functions. If you pull in FastRaw before this module, this module won't have been able to work its magic on FastRaw first, and the functions in your package won't be the ones needed for proper sharing of memory mapped data. Put differently, the earlier you can manage to use PDL::Parallel::threads
, the better.
Package and Name Munging
PDL::Parallel::threads
lets you associate your data with a specific text name. Put differently, it provides a global namespace for data. Users of the C
programming language will immediately notice that this means there is plenty of room for developers using this module to choose the same name for their data. Without some combination of discipline and help, it would be easy for shared memory names to clash. One solution to this would be to require users (i.e. you) to choose names that include their current package, such as My-Module-workspace
or, following perlpragma, My::Module/workspace
instead of just workspace
. This is sometimes called name mangling. Well, I decided that this is such a good idea that PDL::Parallel::threads
does the second form of name mangling for you automatically! Of course, you can opt out, if you wish.
The basic rules are that the package name is prepended to the name of the shared memory as long as the name is only composed of word characters, i.e. names matching /^\w+$/
. Here's an example demonstrating how this works:
package Some::Package;
use PDL;
use PDL::Parallel::threads 'retrieve_pdls';
# Stored under '??foo'
sequence(20)->share_as('??foo');
# Shared as 'Some::Package/foo'
zeroes(100)->share_as('foo');
sub do_something {
# Retrieve 'Some::Package/foo'
my $copy_of_foo = retrieve_pdls('foo');
# Retrieve '??foo':
my $copy_of_weird_foo = retrieve_pdls('??foo');
# ...
}
# Move to a different package:
package Other::Package;
use PDL::Parallel::threads 'retrieve_pdls';
sub something_else {
# Retrieve 'Some::Package/foo'
my $copy_of_foo = retrieve_pdls('Some::Package/foo');
# Retrieve '??foo':
my $copy_of_weird_foo = retrieve_pdls('??foo');
# ...
}
The upshot of all of this is that if you use some module that also uses PDL::Parallel::threads
, namespace clashes are highly unlikely to occur as long as you (and the author of that other module) use simple names, like the sort of thing that works for variable names.
FUNCTIONS
This module provides three stand-alone functions and adds one new PDL method.
share_pdls
Shares ndarray data across threads using the given names.
share_pdls (name => ndarray|filename, name => ndarray|filename, ...)
This function takes key/value pairs where the value is the ndarray to store or the file name to memory map, and the key is the name under which to store the ndarray or file name. You can later retrieve the memory (or an ndarray mapped to the given file name) with the "retrieve_pdls" method.
Sharing an ndarray with physical memory increments the data's reference count; you can decrement the reference count by calling "free_pdls" on the given name
. In general this ends up doing what you mean, and freeing memory only when you are really done using it. Memory mapped data does not need to worry about reference counting as there is always a persistent copy on disk.
my $data1 = zeroes(20);
my $data2 = ones(30);
share_pdls(foo => $data1, bar => $data2);
This can be combined with constructors and fat commas to allocate a collection of shared memory that you may need to use for your algorithm:
share_pdls(
main_data => zeroes(1000, 1000),
workspace => zeroes(1000),
reduction => zeroes(100),
);
share_pdls
does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)
share_as
Method to share an ndarray's data across threads under the given name.
$pdl->share_as(name)
This PDL method lets you directly share an ndarray. It does the exact same thing as "shared_pdls", but its invocation is a little different:
# Directly share some constructed memory
sequence(20)->share_as('baz');
# Share individual ndarrays:
my $data1 = zeroes(20);
my $data2 = ones(30);
$data1->share_as('foo');
$data2->share_as('bar');
Like many other PDL methods, this method returns the just-shared ndarray. This can lead to some amusing ways of storing partial calculations partway through a long chain:
my $results = $input->sumover->share_as('pre_offset') + $offset;
# Now you can get the result of the sumover operation
# before that offset was added, by calling:
my $pre_offset = retrieve_pdls('pre_offset');
This function achieves the same end as "share_pdls": There's More Than One Way To Do It, because it can make for easier-to-read code. In general I recommend using the share_as
method when you only need to share a single ndarray memory space.
share_as
does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)
retrieve_pdls
Obtain ndarrays providing access to the data shared under the given names.
my ($copy1, $copy2, ...) = retrieve_pdls (name, name, ...)
This function takes a list of names and returns a list of ndarrays that provide access to the data shared under those names. In scalar context the function returns the ndarray corresponding with the first named data set, which is usually what you mean when you use a single name. If you specify multiple names but call it in scalar context, you will get a warning indicating that you probably meant to say something differently.
my $local_copy = retrieve_pdls('foo');
my @both_ndarrays = retrieve_pdls('foo', 'bar');
my ($foo, $bar) = retrieve_pdls('foo', 'bar');
retrieve_pdls
does not pay attention to bad values. There is no technical reason for this: it simply hadn't occurred to me until I had to write the bad-data documentation. Expect it to happen in a forthcoming release. :-)
free_pdls
Frees the shared memory (if any) associated with the named shared data.
free_pdls(name, name, ...)
This function marks the memory associated with the given names as no longer being shared, handling all reference counting and other low-level stuff. You generally won't need to worry about the return value. But if you care, you get a list of values---one for each name---where a successful removal gets the name and an unsuccessful removal gets an empty string.
So, if you say free_pdls('name1', 'name2')
and both removals were successful, you will get ('name1', 'name2')
as the return values. If there was trouble removing name1
(because there is no memory associated with that name), you will get ('', 'name2')
instead. This means you can handle trouble with perl grep
s and other conditionals:
my @to_remove = qw(name1 name2 name3 name4);
my @results = free_pdls(@to_remove);
if (not grep {$_ eq 'name2'} @results) {
print "That's weird; did you remove name2 already?\n";
}
if (not $results[2]) {
print "Couldn't remove name3 for some reason\n";
}
This function simply removes an ndarray's memory from the shared pool. It does not interact with bad values in any way. But then again, it does not interfere with or screw up bad values, either.
DIAGNOSTICS
-
You called
share_pdl
with an odd number of arguments, which means that you could not have supplied key/value pairs. Double-check that every ndarray (or filename) that you supply is preceded by its shared name. -
You tried to share some data under
$name
, but some data is already associated with that name. Typo? You can avoid namespace clashes with other modules by using simple names and lettingPDL::Parallel::threads
mangle the name internally for you. -
... the ndarray does not have any allocated memory.
-
You tried to share an ndarray that does not have any memory associated with it.
... the ndarray has no datasv, which means it's probably a special ndarray.
-
You tried to share an ndarray that has no datasv. This usually happens when you try to wrap an ndarray around some externally provided data. It may also happen when you've managed to get data from PDL::IO::FastRaw and you've used the wrong loading order (see "Memory Mapped Data"), or perhaps when you try to share data that you've mapped using PDL::IO::FlexRaw.
... the ndarray's data does not come from the datasv.
-
You tried to share an ndarray that has a funny internal structure, in which the data does not point to the buffer portion of the datasv. I'm not sure how that could happen without triggering a more specific error, so I hope you know what's going on if you get this. :-)
-
... there is no associated header file
-
The header file must have the name "$to_store.hdr". If it doesn't, this module won't be able to map the file.
... you do not have permissions to read the associated header file
-
There seems to be a permissions issue and this module cannot open the header file associated with your mapped data. Check the permissions?
... you do not have write permissions for that file
-
Yes, ostensibly you can work with a memory mapped file that is read only, but that's complicated and I didn't want to have to figure out how to mark your shared ndarray as read-only. Patches welcome!
... the file does not exist
-
The file to memory map doesn't exist. Typo, perhaps?
-
share_pdls
only knows how to store memory mapped files and raw data ndarrays. It'll croak if you try to share other kinds of ndarrays, and it'll throw this error if you try to share anything else, like a hashref. retrieve_pdls: '$name' was created in a thread that has ended or is detached
-
In some other thread, you added some data to the shared pool. If that thread ended without you freeing that data (or the thread has become a detached thread), then we cannot know if the data is available. You should always free your data from the data pool when you're done with it, to avoid this error.
retrieve_pdls could not find data associated with '$name'
-
Pretty simple: either data has never been added under this name, or data under this name has been removed.
retrieve_pdls: requested many ndarrays... in scalar context?
-
This is just a warning. You requested multiple ndarrays (sent multiple names) but you called the function in scalar context. Why do such a thing?
LIMITATIONS
You cannot share memory mapped files that require features of PDL::IO::FlexRaw. That is a cool module that lets you pack multiple ndarrays into a single file, but simple cross-thread sharing is not trivial and is not (yet) supported.
If you are dealing with a physical ndarray (i.e. not memory mapped), you have to be a bit careful about how the memory gets freed. If you don't call free_pdls
on the data, it will persist in memory until the end of the originating thread, which means you have a classic memory leak. On the other hand, if another thread creates a thread-local copy of the data before the originating thread ends, but then tries to access the data after the originating thread ends, you will get a segmentation fault.
Finally, you must load PDL::Parallel::threads
before loading PDL::IO::FastRaw if you wish to share your memory mapped ndarrays. Also, you must have a .hdr
file for your data file, which is not strictly necessary when using mapfraw
. Hopefully that limitation will be lifted in forthcoming releases of this module.
BUGS
None known at this point.
SEE ALSO
PDL::ParallelCPU, MPI, PDL::Parallel::MPI, OpenCL, threads, threads::shared
AUTHOR, COPYRIGHT, LICENSE
This module was written by David Mertens. The documentation is copyright (C) David Mertens, 2012. The source code is copyright (C) Northwestern University, 2012. All rights reserved.
This module is distributed under the same terms as Perl itself.
DISCLAIMER OF WARRANTY
Parallel computing is hard to get right, and it can be exacerbated by errors in the underlying software. Please do not use this software in anything that is mission-critical unless you have tested and verified it yourself. I cannot guarantee that it will perform perfectly under all loads. I hope this is useful and I wish you well in your usage thereof, but BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.