NAME
Archive::Tar::Stream - pure perl IO-friendly tar file management
VERSION
Version 0.01
SYNOPSIS
Archive::Tar::Stream grew from a requirement to process very large archives containing email backups, where the IO hit for unpacking a tar file, repacking parts of it, and then unlinking all the files was prohibitive.
Archive::Tar::Stream takes two file handles, one purely for reads, one purely for writes. It does no seeking, it just unpacks individual records from the input filehandle, and packs records to the output filehandle.
This module does not attempt to do any file handle management or compression for you. External zcat and gzip are quite fast and use separate cores.
use Archive::Tar::Stream;
my $ts = Archive::Tar::Stream->new(outfh => $fh);
$ts->AddFile($name, -s $fh, $fh);
# remove large non-jpeg files from a tar.gz
my $infh = IO::File->new("zcat $infile |") || die "oops";
my $outfh = IO::File->new("| gzip > $outfile") || die "double oops";
my $ts = Archive::Tar::Stream->new(infh => $infh, outfh => $outfh);
$ts->StreamCopy(sub {
my ($header, $outpos, $fh) = @_;
# we want all small files
return 'KEEP' if $header->{size} < 64 * 1024;
# and any other jpegs
return 'KEEP' if $header->{name} =~ m/\.jpg$/i;
# no, seriously
return 'EDIT' unless $fh;
return 'KEEP' if mimetype_of_filehandle($fh) eq 'image/jpeg';
# ok, we don't want other big files
return 'SKIP';
});
SUBROUTINES/METHODS
new
my $ts = Archive::Tar::Stream->new(%args);
Args: infh - filehandle to read from outfh - filehandle to write to inpos - initial offset in infh outpos - initial offset in outfh safe_copy - boolean.
Offsets are for informational purposes only, but can be useful if you are tracking offsets of items within your tar files separately. All read and write functions update these offsets. If you don't provide offsets, they will default to zero.
Safe Copy is the default - you have to explicitly turn it off. If Safe Copy is set, every file is first extracted from the input filehandle and stored in a temporary file before appending to the output filehandle. This uses slightly more IO, but guarantees that a truncated input file will not corrupt the output file.
SafeCopy
$ts->SafeCopy(0);
Toggle the "safe_copy" field mentioned above.
InPos
OutPos
Read only accessors for the internal position trackers for the two tar streams.
AddFile
Adds a file to the output filehandle, adding sensible defaults for all the extra header fields.
Requires: outfh
my $header = $ts->AddFile($name, $size, $fh, %extra);
See TARHEADER for documentation of the header fields.
You must provide 'size' due to the non-seeking nature of this library, but "-s $fh" is usually fine.
Returns the complete header that was written.
AddLink
my $header = $ts->AddLink($name, $linkname, %extra);
Adds a symlink to the output filehandle.
See TARHEADER for documentation of the header fields.
Returns the complete header that was written.
StreamCopy
Streams all records from the input filehandle and provides an easy way to write them to the output filehandle.
Requires: infh Optional: outfh - required if you return 'KEEP'
$ts->StreamCopy(sub {
my ($header, $outpos, $fh) = @_;
# ...
return 'KEEP';
});
The chooser function can either return a single 'action' or a tuple of action and a new header.
The action can be: KEEP - copy this file as is (possibly changed header) to output tar EDIT - re-call $Chooser with filehandle SKIP - skip over the file and call $Chooser on the next one EXIT - skip and also stop further processing
EDIT mode:
the file will be copied to a temporary file and the filehandle passed to $Chooser. It can truncate, rewrite, edit - whatever. So long as it updates $header->{size} and returns it as $newheader it's all good.
you don't have to change the file of course, it's also good just as a way to view the contents of some files as you stream them.
A standard usage pattern looks like this:
$ts->StreamCopy(sub {
my ($header, $outpos, $fs) = @_;
# simple checks
return 'KEEP' if do_want($header);
return 'SKIP' if dont_want($header);
return 'EDIT' unless $fh;
# checks that require a filehandle
});
ReadBlocks
Requires: infh
my $raw = $ts->ReadBlocks($nblocks);
Reads 'n' blocks of 512 bytes from the input filehandle and returns them as single scalar.
Returns undef at EOF on the input filehandle. Any further calls after undef is returned will die. This is to avoid naive programmers creating infinite loops.
nblocks is optional, and defaults to 1.
WriteBlocks
Requires: outfh
my $pos = $ts->WriteBlocks($buffer, $nblocks);
Write blocks to the output filehandle. If the buffer is too short, it will be padded with zero bytes. If it's too long, it will be truncated.
nblocks is optional, and defaults to 1.
Returns the position of the header in the output stream.
ReadHeader
Requires: infh
my $header = $ts->ReadHeader(%Opts);
Read a single 512 byte header off the input filehandle and convert it to a TARHEADER format hashref. Returns undef at the end of the file.
If the option (SkipInvalid => 1) is passed, it will skip over blocks which fail to pass the checksum test.
WriteHeader
Requires: outfh
my $newheader = $ts->WriteHeader($header);
Read a single 512 byte header off the input filehandle.
If the option (SkipInvalid => 1) is passed, it will skip over blocks which fail to pass the checksum test.
Returns a copy of the header with _pos set to the position in the output file.
ParseHeader
my $header = $ts->ParseHeader($block);
Parse a single block of raw bytes into a TARHEADER format header. $block must be exactly 512 bytes.
Returns undef if the block fails the checksum test.
BlankHeader
my $header = $ts->BlankHeader(%extra);
Create a header with sensible defaults. That means time() for mtime, 0777 for mode, etc.
It then applies any 'extra' fields from %extra to generate a final header. Also validates the keys in %extra to make sure they're all known keys.
CreateHeader
my $block = $ts->CreateHeader($header);
Creates a 512 byte block from the TARHEADER format header.
CopyBytes
$ts->CopyBytes($bytes);
Copies bytes from input to output filehandle, rounded up to block size, so only whole blocks are actually copied.
DumpBytes
$ts->DumpBytes($bytes);
Just like CopyBytes, but it doesn't write anywhere. Reads full blocks off the input filehandle, rounding up to block size.
FinishTar
$ts->FinishTar();
Writes 5 blocks of zero bytes to the output file, which makes gnu tar happy that it's found the end of the file.
Don't use this if you're planning on concatenating multiple files together.
CopyToTempFile
my $fh = $ts->CopyToTempFile($header->{size});
Creates a temporary file (with File::Temp) and fills it with the contents of the file on the input stream. It reads entire blocks, and discards the padding.
CopyFromFh
$ts->CopyFromFh($fh, $header->{size});
Copies the contents of the filehandle to the output stream, padding out to block size.
TARHEADER format
This is the "BlankHeader" output, which includes all the fields in a standard tar header:
my %hash = (
name => '',
mode => 0777,
uid => 0,
gid => 0,
size => 0,
mtime => time(),
typeflag => '0', # this is actually the STANDARD plain file format, phooey. Not 'f' like Tar writes
linkname => '',
uname => '',
gname => '',
devmajor => 0,
devminor => 0,
prefix => '',
);
You can read more about the tar header format produced by this module on wikipedia: http://en.wikipedia.org/wiki/Tar_(file_format)#UStar_format or here: http://www.mkssoftware.com/docs/man4/tar.4.asp
Type flags:
'0' Normal file
(ASCII NUL) Normal file (now obsolete)
'1' Hard link
'2' Symbolic link
'3' Character special
'4' Block special
'5' Directory
'6' FIFO
'7' Contiguous file
Obviously some module wrote 'f' as the type - I must have found that during original testing. That's bogus though.
AUTHOR
Bron Gondwana, <perlcode at brong.net>
BUGS
Please report any bugs or feature requests to bug-archive-tar-stream at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Archive-Tar-Stream. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Archive::Tar::Stream
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
LATEST COPY
The latest copy of this code, including development branches, can be found at
http://github.com/brong/Archive-Tar-Stream/
LICENSE AND COPYRIGHT
Copyright 2011 Opera Software Australia Pty Limited
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.