NAME

File::Unpack - An aggressive bz2/gz/zip/tar/cpio/rpm/deb/cab/lzma/7z/rar/... archive unpacker, based on mime-types

VERSION

Version 0.36

SYNOPSIS

This perl module comes with an executable script:

/usr/bin/file_unpack -h

/usr/bin/file_unpack [-1] [-m] ARCHIVE...

File::Unpack is an aggressive unpacker for archive files. Aggressive means, it can recursivly descend into freshly unpacked files, if they are archives themselves. It also uncompresses files where needed. File::Unpack will extract as much readable text (ascii or any other encoding) as possible. Most of the currently known archive file formats are supported.

use File::Unpack;

my $log;
my $u = File::Unpack->new(logfile => \$log);

my $m = $u->mime('/etc/init.d/rc');
print "$m->[0]; charset=$m->[1]\n";
# text/x-shellscript; charset=us-ascii

map { print "$_->{name}\n" } @{$u->mime_handler()};
# application/%rpm
# application/%tar+gzip
# application/%tar+bzip2
# ...

$u->unpack("inputfile.tar.bz2");
while ($log =~ m{^\s*"(.*?)":}g) # it's JSON.
  {
    print "$1\n"; 	# report all files unpacked
  }

...

unpack() examines the contents of an archive file or directory using an extensive mime-type analysis. The contents is unpacked recursively to the given destination directory; a listing of the unpacked files is reported through the built in logging facility during unpacking. Most common archive file formats are handled directly; more can easily be added as mime-type helper plugins.

SUBROUTINES/METHODS

new

my $u = new(destdir => '.', logfile => \*STDOUT, maxfilesize => '100M', verbose => 1);

Creates an unpacker instance. The parameter destdir must be a writable location; all output files and directories are placed inside this destdir. Subdirectories will be created in an attempt to reflect the structure of the input. Destdir defaults to the current directory; relative paths are resolved immediatly, so that chdir() after calling new is harmless.

The parameter logfile can be a reference to a scalar, a filename, or a filedescriptor. The logfile starts with a JSON formatted prolog, where all lines start with printable characters. For each file unpacked, a one line record is appended, starting with a single whitespace ' ', and terminated by "\n". The format is a JSON-encoded "key": {value},\n pair, where key is the filename, and value is a hash including 'mime', 'size', and other information. The logfile is terminated by an epilog, where each line starts with a printable character. As part of the epilog, a dummy file named "\" with an empty hash is added to the list. It should be ignored while parsing. Per default, the logfile is sent to STDOUT.

The parameter maxfilesize is a safeguard against compressed sparse files. Such files could easily fill up any available disk space when unpacked. Files hitting this limit will be silently truncated. Check the logfile records or epilog to see if this has happened. BSD::Resource is used manipulate RLIMIT_FSIZE.

The parameter one_shot can optionally be set to non-zero, to limit unpacking to one step of unpacking. Unpacking of well known compressed archives like e.g. '.tar.bz2' is considered one step only. If uncompressing is considered an extra step depends on the configured mime helpers.

exclude

exclude(add => ['.svn', '*.orig' ], del => '.svn', force => 1)

Defines the exclude-list for unpacking. This list is advisory for the mime-handlers. The exclude-list items are shell glob patterns, where '*' or '?' never match '/'.

You can use force to have any of these removed after unpacking. Use (vcs => 1) to exclude a long list of known version control system directories, use (vcs => 0) to remove them. The default is exclude(empty => 1), which is the same as exclude(empty_file => 1, empty_dir => 1) -- having the obvious meaning.

(re => 1) returns the active exclude-list as a regexp pattern. Otherwise exclude always returns the list as an array ref.

Symbolic links are always excluded.

unpack

$u->unpack($archive, [$destdir])

Determines the contents of an archive and recursivly extracts its files. An archive may be the pathname of a file or directory. The extracted contents will be stored in "destdir/$subdir/$dest_name", where dest_name is the filename component of archive without any leading pathname components, and possibly stripped or added suffix. (Subdir defaults to ''.) If archive is a directory, then dest_name will also be a directory. If archive is a file, the type of dest_name depends on the type of packing: If the archive expands to multiple files, dest_name will be a directory, otherwise it will be a file. If a file of the same name already exists in the destination subdir, an additional subdir component is created to avoid any conflicts. For each extracted file, a record is written to the logfile. When unpacking is finished, the logfile contains one valid JSON structure. Unpack achieves this by writing suitable prolog and epilog lines to the logfile. The logfile can also be parsed line by line. All file records is one line and start with a ' ' whitespace, and end in a ',' comma. Everything else is prolog or epilog.

The actual unpacking is dispatched to mime-type specfic handlers, selected using mime. A mime-handler can either be built-in code, or an external program (or shell-script) found in a directory registered with mime_handler_dir. The standard place for external handlers is /usr/share/File-Unpack/helper; it can be changed by the environment variable FILE_UNPACK_HELPER_DIR or the new parameter helper_dir.

A mime-handler is called with 6 parameters: source_path, destfile, destination_path, mimetype, description, and config_dir. Note, that destination_path is a freshly created empty working directory, even if the unpacker is expected to unpack only a single file. The unpacker is called after chdir into destination_path, so you usually do not need to evaluate the third parameter.

The directory config_dir contains unpack configuration in .sh, .js and possibly other formats. A mime-handler may use this information, but need not. All data passed into new is reflected there, as well as the active exclude-list. Using the config information can help a mime-handler to skip unwanted work or otherwise optimize unpacking.

unpack monitors the available filesystem space in destdir. If there is less space than configured with minfree, a warning can be printed and unpacking is optionally paused. It also monitors the mime-handlers progress reading the archive at source_path and reports percentages to STDERR (if verbose is 1 or more).

After the mime-handler is finished, unpack examines the files it created. If it created no files in destdir, an error is reported, and the source_path may be passed to other unpackers, or finally be added to the log as is.

If the mime-handler wants to express that source_path is already unpacked as far as possible and should be added to the log without any error messages, it creates a symbolic link destdir pointing to source_path.

The system considers replacing the directory with a file, if all of the following conditions are met:

  • There is exactly one file in the directory.

  • The file name is identical with the directory name, except for one changed or removed suffix-word. (*.tar.gz -> *.tar; or *.tgz -> *.tar)

  • The file must not already exist in the parent directory.

unpack prepares 20 empty subdirectory levels and chdirs the unpacker in there. This number can be adjusted using new(dot_dot_safeguard => 20). A directory 20 levels up from the current working dir has mode 0 while the mime-handler runs. unpack can optionally chmod(0) the parent of the subdirectory after it chdirs the unpacker inside. Use new(jail_chmod0 => 1) for this, default is off. If enabled, a mime-handler trying to place files outside of the specified destination_path may receive 'permission denied' conditions.

These are special hacks to keep badly constructed tar-balls, cpio-, or zip-archives at bay.

Please note, that this can help against archives containing relative paths (like starting with '../../../foo'), but will be ineffective with absolute paths (starting with '/foo'). It is the responsibility of mime-handlers to not create absolute paths; unpack should not be run as the root user, to minimize the risk of compromising the root filesystem.

A missing mime-handler is skipped, and subsequent handlers may take effect. A mime-handler is expected to return an exit status of 0 upon success. If it runs into a problem, it should print lines starting with the affected filenames to stderr. Such errors are recorded in the log with the unpacked archive, and as far as files were created, also with these files.

Symbolic links are ignored while unpacking.

run

$u->run([argv0, ...], @redir, ... { init => sub ..., in, out, err, watch, every, prog, ... })

A general purpose fork-exec wrapper, based on IPC::Run. STDIN is closed, unless you specify an in => as described in IPC::Run. STDERR and STDOUT are both printed to STDOUT, prefixed with 'E: ' and 'O: ' respectively, unless you specify out =>, err =>, or out_err => ... for both.

Using redirection operators in @redir takes precedence over the above in/out/err redirections. See also IPC::Run. If you use the options in/out/err, you should restrict your redirection operators to the forms '<', '0<', '1>', '2>', or '>&' due to limitations in the precedence logic. Piping via '|' is properly recognized, but background execution '&' may confuse the precedence logic.

This run method is completly independent of the rest of File::Unpack. It works both as a static function and as a method call. It is used internally by unpack, but is exported to be of use elsewhere.

Init is run after construction of redirects. Calling chdir() in init thus has no effect on redirects with relative paths.

Return value in scalar context is the first nonzero result code, if any. In list context all return values are returned.

fmt_run_shellcmd

File::Unpack::fmt_run_shellcmd( $m->{argvv} )

Static function to pretty print the return value $m of method find_mime_handler(); It formats a command array used with run() as a properly escaped shell command string.

mime_handler_dir mime_handler

$u->mime_handler_dir($dir, ...) $u->mime_handler($mime_name, $suffix_regexp, \@argv, @redir, ...)

Registers one or more directories where external mime-helper programs are found. The words helper and handler are used as synonyms here, helpers often refer to external programs, where handlers refer to builtin shell commands. Multiple directories can be registered, They are searched in reverse order, i.e. last added takes precedence. Any external mime-handler takes precedence over built-in code.

The suffix_regexp is used to derive the destination name from the source name. It is not used for selecting helpers.

Helpers are mapped to mime-types by their mime_name. The name can be constructed from the mimetype by replacing the '/' with a '=' character, and by using the word 'ANY' as a wildcard component. The '=' character is interpreted as an implicit '=ANY+' if needed.

 Examples:

  Mimetype                   handler names tried from top to bottom
  -----------------------------------------------------------------
  image/png                  image=png 
                              image=ANY 
			       image
			        ANY=png
			         ANY=ANY
				  ANY

  application/vnd.oasis+zip  application=vnd.oasis+zip 
                              application=ANY+zip
                               application=ANYzip
			        application=zip
			         application=ANY
				      ...
  

A trailing '=ANY' is implicit, as shown by these examples. The rules for precedence are this:

  • Search in the latest directory is exhaused first, then the previously added directory is considered in turn, up to all directories have been traversed, or until a matching helper is found.

  • A matching name with wildcards has lower precedence than a matching name without.

  • A wildcard before the '=' sign lowers precedence more than one after it.

The mapping takes place when mime_handler_dir is called. Adding helper scripts to a directory afterwards has no effect. mime_handler does not do any implicit expansions. Call it multiple times with the same handler command and different names if needed. The default argument list is "%(src)s %(destfile)s %(destdir)s %(mime)s %(descr)s %(configdir)s" -- this is applied, if no args are given and no redirections are given. See also unpack for more semantics and how a handler should behave.

Both methods return an ARRAY-ref of HASHes describing all known (old and newly added) mime handlers.

list

Returns an ARRAY of preformatted patterns and mime-handlers.

Example:

printf @$_ for $u->list(); 

find_mime_handler

$u->find_mime_handler($mimetype)

Returns a mime-handler suitable for unpacking the given $mimetype. If called in list context, a second return value indicates which mime handlers whould be suitable, but could not be found in the system.

minfree

$u->minfree(factor => 10, bytes => '100M', percent => '3%', warning => sub { .. })

THESE TESTS ARE TO BE IMPLEMENTED.

Guard the filesystem (destdir) against becoming full during unpack. Before unpacking each source archive, the free space is measured and compared against three conditions:

  • The archive size multiplied with the given factor must fit into the filesystem.

  • The given number of bytes (in optional K, M, G, or T units) must be free.

  • The filesystem must have at least the given free percentage. The '%' character is optional.

The warning method is called if any of the above conditions fail. Its signature is: &warning->($pathname, $full_percentage, $free_bytes, $free_inodes); It is expected to print an appropriate warning message, and delay a few seconds. It should return 0 to cause a retry. It should return nonzero to continue unpacking. The default warning method prints a message to STDERR, waits 30 seconds, and returns 0.

The filesystem may still become full and unpacking may fail, if e.g. factor was chosen lower than the average compression ratio of the archives.

mime

$u->mime($filename)

$u->mime(file => $filename)

$u->mime(buf => "#!/bin ...", file => "what-was-read")

$u->mime(fd => \*STDIN, file => "what-was-opened")

Determines the mimetype (and optionally additional information) of a file. The file can be specified by filename, by a provided buffer or an opened filedescriptor. For the latter two cases, specifying a filename is optional, and used only for diagnostics.

mime uses libmagic by Christos Zoulas exposed via File::LibMagic and also uses the shared-mime-info database from freedesktop.org exposed via File::MimeInfo::Magic, if available. Either one is sufficient, but having both is better. LibMagic sometimes says 'text/x-pascal', although we have a .desktop file, or says 'text/plain', but has contradicting details in its description.

File::MimeInfo::Magic::magic is consulted where the libmagic output is dubious. E.g. when the desciption says something interesting like 'Debian binary package (format 2.0)' but the mimetype says 'application/octet-stream'. The combination of both libraries gives us excellent reliability in the critical field of mime-type recognition.

This implementation also features multi-level mime-type recognition for efficient unpacking. When e.g. unpacking a large bzipped tar archive, this saves us from creating a huge temporary tar-file which unpack would extract in a second step. The multi-level recognition returns 'application/x-tar+bzip2' in this case, and allows for a mime-handler to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf' does, making a very simple and efficient mime-handler).

mime returns a 3 or 4 element arrayref with mimetype, charset, description, diff; where diff is only present when the libfile and shared-mime-info methods disagree.

In case of 'text/plain', an additional rule based on file name suffix is used to allow recognition of well known plain text pack formats. We return 'text/x-suffix-XX+plain', where XX is one of the recognized suffixes (in all lower case and without the dot). E.g. a plain mmencoded file has no header and looks like 'plain/text' to all the known magic libraries. We recognize the suffixes .mm, .b64, and .base64 for this (case insignificant). A similar rule exitst for 'application/octect-stream'. It may trigger e.g. for lzma compressed files which fail to provide a magic number.

Examples:

[ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']

[ 'text/x-mpegurl', 'utf-8', 'M3U playlist text', 
  [ 'text/plain', 'application/x-mpegurl']]

[ 'application/x-tar+bzip2, 'binary', 
  "bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]

AUTHOR

Juergen Weigert, <jnw at cpan.org>

BUGS

The implementation of mime is an ugly hack. We suffer from the existance of multiple file magic databases, and multiple conflicting implementations. With perl we have at least 5 modules for this; here we use two.

The builtin list of mime-handlers is incomplete. Please submit your handler code.

Please report any bugs or feature requests to bug-file-unpack at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

RELATED MODULES

While designing File::Unpack, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack and became dependencies or are recommended. Others exposed drawbacks during closer examination and may find some of their wheels re-invented here.

Used Modules

File::LibMagic

This is the prefered mimetype engine. It disregards the suffix, recognizes more types than any of the alternatives, and uses exactly the same engine as /usr/bin/file in openSUSE systems. It also returns charset and description information. We crossreference the description with the mimetype to detect weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g. detecting LZMA compression which fails to provide any recognizable magic. Required if you use mime; otherwise not a hard requirement.

File::MimeInfo::Magic

Uses both magic information and file suffixes to determine the mimetype. Its magic() function is used in a few cases, where File::LibMagic fails. E.g. as of June 2010, libmagic does not recognize 'image/x-targa'. File::MimeInfo::Magic may be slower, but it features the shared-mime-info database from freedesktop.org . Recommended if you use mime.

String::ShellQuote

Used to call external mime-handlers. Required.

BSD::Resource

Used to reliably restrict the maximum file size. Recommended.

File::Path

mkpath(). Required.

Cwd

fast_abs_path(). Required.

JSON

Used for formatting the logfile. Required.

Modules Not Used

Archive::Extract

Archive::Extract tries first to determine what type of archive you are passing it, by inspecting its suffix. 'Maybe this module should use something like "File::Type" to determine the type, rather than blindly trust the suffix'. [quoted from perldoc]

Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command line programs and won't consume so much memory. Default: use "Archive::Tar".

Archive::Zip

If you are just going to be extracting zips (and/or other archives) you are recommended to look at using Archive::Extract . [quoted from perldoc] It is pure perl, so it's a lot slower then your '/usr/bin/zip'.

Archive::Tar

It is pure perl, so it's a lot slower then your "/bin/tar". It is heavy on memory, all will be read into memory. [quoted from perldoc]

File::MMagic, File::MMagic::XS, File::Type

Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior. They often say 'text/plain' or 'application/octet-stream' where the latter two report useful mimetypes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc File::Unpack

You can also look for information at:

SOURCE REPOSITORY

https://developer.berlios.de/projects/perl-file-unpck

svn co https://svn.berlios.de/svnroot/repos/perl-file-unpck/trunk/File-Unpack

ACKNOWLEDGEMENTS

Mime-type recognition relies heavily on libmagic by Christos Zoulas. I had long hesitated implementing File::Unpack, but set to work, when I dicovered that File::LibMagic brings your library to perl. Thanks Christos. And thanks for tcsh too.

LICENSE AND COPYRIGHT

Copyright 2010,2011 Juergen Weigert.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.