NAME

File::Unpack - An aggressive archive file unpacker, based on mime-types

VERSION

Version 0.18

SYNOPSIS

Quick summary of what the module does.

Perhaps a little code snippet.

use File::Unpack;

my $u = File::Unpack->new();

my $m = $u->mime('/etc/init.d/rc');
print "$m->[0]; charset=$m->[1]\n"
# text/x-shellscript; charset=us-ascii

...

Examines the contents of an archive file or directory by extensive mime-type analysis. The contents is unpacked recursively to the given destination directory; a listing of the unpacked files is reported through the built in logging facility during unpacking. The mime-type handlers are customizable, as well as exclude patterns.

SUBROUTINES/METHODS

Beware: This module is unfinished. Documentation might be ahead of implementation. The testsuite defines, what is actually available.

new(destdir => '.', logfile => \*STDOUT, maxfilesize => '100M', verbose => 1)

Creates an unpacker instance. Destdir must be writable; all output files and directories are placed inside destdir. Subdirectories will be created in an attempt to reflect the structure of the input. Destdir defaults to the current directory; relative paths are resolved immediatly, so that later chdir() has no effect on destdir.

The parameter logfile can be a reference to a scalar, a filename, or a filedescriptor. The logfile starts with a JSON formatted prolog, where all lines start with printable characters. For each file unpacked, a one line record is appended, started with a single whitespace ' ', and terminated by "\n". Each record is formatted as a JSON " key: value\n" pair, where key is the filename, and value a hash including mime, size, and other information. The logfile is terminated by an epilog, where each line starts with a printable character. Per default, the logfile is sent to STDOUT.

The parameter maxfilesize is a safeguard against compressed sparse files. Such files could easily fill up any available disk space when unpacked. Files hitting this limit will be silently truncated. Check the logfile records or epilog to see if this has happened. BSD::Resource is used manipulate RLIMIT_FSIZE.

exclude(add => ['.svn', '*.orig' ], del => '.svn', force => 1)

Defines the exclude-list for unpacking. This list is advisory for the mime-handlers. The exclude-list items are shell glob patterns, where '*' or '?' never match '/'.

You can use force to have any of these removed after unpacking. Use (vcs => 1) to exclude a long list of known version control system directories, use (vcs => 0) to remove them. The default is exclude(empty = 1)>, which is the same as (empty_file => 1, empty_dir => 1) -- having the obvious meaning.

(re => 1) returns the active exclude-list as a regexp pattern. Otherwise exclude always returns the list as an array ref.

unpack($archive, [$destdir])

Determines the contents of an archive and recursivly extracts its individual files. An archive may be the pathname of a file or directory. The extracted contents will be stored in "destdir/$subdir/$dest_name", where dest_name is the filename component of $archive without any leading pathname components, and possibly stripped or added suffix. (Subdir defaults to ''.) If $archive is a directory, then dest_name will also be a directory. If archive is a file, the type of dest_name depends on the type of packing: If the archive expands to multiple files, dest_name will be a directory, otherwise it will be a file. If a file of the same name already exists in the destination subdir, an additional subdir component is created to avoid any conflicts. For each extracted file, a record is written to the logfile. When unpacking is finished, the logfile contains one valid JSON structure. Unpack achieves this by writing suitable prolog and epilog lines to the logfile.

The actual unpacking is dispatched to mime-type specfic mime handlers, selected using mime. A mime-handler can either be built-in code, or an external program (or shell-script) found in a directory registered with use_mime_handler_dir.

A mime-handler is called with 6 parameters: source_path, destdir, destfile, mimetype, description, and config_dir. Note, that destination_path is a freshly created empty working directory, even if the unpacker is expected to unpack only a single file.

The config_dir contains unpack configuration in .sh, .js and possibly other formats. A mime-handler should use this information, but need not. All data passed into new is reflected there, as well as the active exclude-list. Using the config information can be used by a mime-handler to skip unwanted work or otherwise optimize unpacking.

unpack monitors the available filesystem space in destdir. If there is less space than configured with minfree, a warning can be printed and unpacking is optionally paused. It also monitors the mime-handlers progress reading the archive at source_path and reports percentages to STDERR (if verbose is 1 or more).

After the mime-handler is finished, they system considers replacing the directory with a file, under the following conditions:

  • There is exactly one file in the directory.

  • The file name is identical with directory name, except for one changed or removed suffix-word. (*.tar.gz -> *.tar; or *.tgz -> *.tar)

  • The file must not already exist in the parent directory.

A mime-handler trying to place files outside of the specified destination_path may receive 'permission denied' conditions. # In this case, the handler # can ask to be rerun with write permission to a certain number of parent # directories by printing one or multiple strings of the form "../../file" # and exiting with a nonzero status. Unpack will respond by creating the # needed number of additional subdirectories, each named '_' (two in this # example: "./_/_" ), and will call the handler again with this extended # destination_path. unpack prepares 20 empty subdirectory levels and chdirs the unpacker in there. This number can be adjusted using new(dot_dot_safeguard = 20)>. A directory 20 levels up from the current working dir has mode 0 while the mime-handler runs. This is a special hack to cope with badly constructed tar balls. This helps against relative paths, but not against absolute paths. It is the responsibility of mime-handlers to not create absolute paths.

A missing mime-handler is skipped. A mime-handler is expected to return an exit status of 0 upon success. If it runs into a problem, it should print lines starting with the affected filenames to stderr. Such errors are recorded in the log with the unpacked archive, and as far as files were created, also with these files.

run([argv0, ...], @redir, ... { init => sub ..., in, out, err, watch, every, prog, ... })

A general purpose fork-exec wrapper, based on IPC::Run. STDIN is closed, unless you specify an in => as described in IPC::Run. STDERR and STDOUT are both printed to STDOUT, prefixed with 'E: ' and 'O: ' respectively, unless you specify out =>, err =>, or out_err => ... for both.

Using redirection operators in @redir takes precedence over the above in/out/err redirections. See also IPC::Run. If you use the options in/out/err, you should restrict your redirection operators to the forms '<', '0<', '1>', '2>', or '>&' due to limitations in the precedence logic. Piping via '|' is properly recognized, but background execution '&' may confuse the precedence logic.

This run method is completly independent of the rest of File::Unpack. It works both as a static function and as a method call. It is used internally by unpack, but is exported to be of use elsewhere.

Init is run after construction of redirects. Calling chdir() in init thus has no effect on redirects with relative paths.

Return value in scalar context is the first nonzero result code, if any. In list context all return values are returned.

File::Unpack::fmt_run_shellcmd($m->{argvv})

Static function to pretty print the return value $m of method find_mime_handler(); It formats a command array used with run() as a properly escaped shell command string.

use_mime_handler_dir($dir, ...)

use_mime_handler($mime_name, $suffix_regexp, \@argv, @redir, ...)

Registers one or more directories where external mime-handler programs are found. Multiple directories can be registered, They are searched in reverse order, i.e. last added takes precedence. Any external mime-handler takes precedence over built-in code. An array ref to the new list of directories is returned.

Helpers are mapped to mime-types by their name. The name can be constructed from the mimetype by replacing the '/' with a '=' character, and by using the word 'ANY' as a wildcard component. The '=' character is interpreted as an implicit '=ANY+' if needed.

 Examples:

  Mimetype                   handler names tried in sequence
  ----------------------------------------------------------
  image/png                  image=png 
                              image=ANY 
			       image
			        ANY=ANY
				 ANY

  application/vnd.oasis+zip  application=vnd.oasis+zip 
                              application=ANY+zip
                               application=ANYzip
			        application=zip
			         application=ANY
				      ...
  

A trailing '=ANY' is implicit, as shown by these examples. The rules for determinig precedence are this:

  • Search in one directory is exhaused before the next is considered.

  • A matching name with wildcards has lower precedence than a matching name without.

  • A wildcard before the '=' sign lowers precedence more than one after it.

The mapping takes place when use_mime_handler_dir is called, later additions are not recognized. use_mime_handler does not do any implicit expansions. Call it multiple times with the same command and different names if needed. The default argument list is "%(src)s %(destdir)s %(destfile)s %(mime)s %(descr)s %(configdir)s" -- this is applied, if no args are given and no redirections are given.

find_mime_handler($mimetype)

Returns a mime-handler suitable for unpacking the given $mimetype. If called in list context, a second return value indicates which mime handlers whould be suitable, but could not be found in the system.

minfree(factor => 10, bytes => '100M', percent => '3%', warning => sub { .. })

Guard the filesystem (destdir) against becoming full during unpack. Before unpacking each source archive, the free space is measured and compared against three conditions:

  • The archive size multiplied with the given factor must fit into the filesystem.

  • The given number of bytes in optional K, M, G, or T units must be free.

  • The filesystem must have at least the given free percentage. The '%' character is optional.

The warning method is called with the following parameters: &warning->($pathname, $full_percentage, $free_bytes, $free_inodes); It is expected to print an appropriate warning message, and delay a few seconds. It should return 0 to cause a retry. It should return nonzero to continue unpacking. The default warning method prints a message to STDERR, waits 30 seconds, and returns 0.

The filesystem may still become full and unpacking may fail, if e.g. factor was chosen lower then the compression ratio of the unpacked archives.

mime($filename)

mime(file => $filename)

mime(buf => "#!/bin ...", file => "what-was-read")

mime(fd => \*STDIN, file => "what-was-opened")

Determines the mimetype (and optionally additional information) of a file. The file can be specified by filename, by a provided buffer or an opened filedescriptor. For the latter two casese, speifying the filename is optional, and used for diagnostics.

mime uses Christos Zoulas' excellent libmagic exposed via File::LibMagic and the shared-mime-info database from freedesktop.org exposed via File::MimeInfo::Magic, if available. Either one is sufficient, but having both is better. LibMagic sometimes says 'text/x-pascal', although we have a .desktop file, or returns says 'text/plain', but has contradicting details in its description.

File::MimeInfo::Magic::magic is consulted where the libmagic output is dubious.

This implementation also features multi-level mime-type recognition for efficient unpacking. If we'd recognize a large bzipped tar ball only as bzip, we'd unpack a huge temporary tar-file, consuming the same amount of disk space as its content, which unpack would extract in a second step. The multi-level recognition returns 'application/x-tar+bzip2' in this case, and allows for a mime-handler to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf' does, making a very simple and efficient mime-handler).

mime returns a 3 or 4 element arrayref with mimetype, charset, description, diff; where diff is only present when both methods disagree.

Examples:

[ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']

[ 'text/x-mpegurl', 'utf-8', 'M3U playlist text', 
  [ 'text/plain', 'application/x-mpegurl']]

[ 'application/x-tar+bzip2, 'binary', 
  "bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]

AUTHOR

Juergen Weigert, <jw at suse.de>

BUGS

The implementation of mime is an ugly hack. We suffer from the existance of multiple file magic databases, and multiple conflicting implementations. With perl we have at least 5 modules for this; here we use two.

The builtin list of mime-handlers is incomplete. Please submit your handler code.

Please report any bugs or feature requests to bug-file-unpack at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

RELATED MODULES

While designing File::Unpack, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack and became dependencies or are recommended. Others exposed drawbacks during closer examination and may find some of their wheels re-invented here.

Used Modules

File::LibMagic

This is the prefered mimetype engine. It disregards the suffix, recognizes more types than any of the alternatives, and uses exactly the same engine as /usr/bin/file in your openSUSE system. It also returns charset and description information. We crossreference the description with the mimetype to detect weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g. detecting LZMA compression which fails to provide any recognizable magic. Required if you use mime; not a hard requirement.

File::MimeInfo::Magic

Uses both magic information and file suffixes to determine the mimetype. Its magic() function is used in a few cases, where File::LibMagic fails. E.g. as of June 2010, libmagic does not recognize 'image/x-targa'. File::MimeInfo::Magic may be slower, but it features the shared-mime-info database from freedesktop.org . Recommended if you use mime.

String::ShellQuote

Used to call external mime-handlers. Required.

BSD::Resource

Used to reliably restrict the maximum file size. Recommended.

File::Path

mkpath(). Required.

Cwd

fast_abs_path(). Required.

JSON

Used for formatting the logfile. Required.

Modules Not Used

Archive::Extract

Archive::Extract tries first to determine what type of archive you are passing it, by inspecting its suffix. It does not do this by using Mime magic. Maybe this module should use something like "File::Type" to determine the type, rather than blindly trust the suffix. [quoted from perldoc]

Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command line programs and won't consume so much memory. Default: use "Archive::Tar".

Archive::Zip

If you are just going to be extracting zips (and/or other archives) you are recommended to look at using Archive::Extract . [quoted from perldoc] It is pure perl, so it's a lot slower then your '/usr/bin/zip'.

Archive::Tar

It is pure perl, so it's a lot slower then your "/bin/tar". It is heavy on memory, all will be read into memory. [quoted from perldoc]

File::MMagic, File::MMagic::XS, File::Type

Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior. They often say 'text/plain' or 'application/octet-stream' where the latter two report useful mimetypes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc File::Unpack

You can also look for information at:

SOURCE REPOSITORY

https://developer.berlios.de/projects/perl-file-unpck

svn co https://svn.berlios.de/svnroot/repos/perl-file-unpck/trunk/File-Unpack

ACKNOWLEDGEMENTS

Mime-type recognition relies heavily on libmagic by Christos Zoulas. I had long hesitated implementing File::Unpack, but set to work, when I dicovered that File::LibMagic brings your library to perl. Thanks Christos. And thanks for tcsh too.

LICENSE AND COPYRIGHT

Copyright 2010 Juergen Weigert.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.