NAME
Net::ChooseFName - Perl extension for choosing a name of a local mirror of a net (e.g., FTP or HTTP) resource.
SYNOPSIS
use Net::ChooseFName;
$namer = Net::ChooseFName->new(max_length => 64); # Copies to CD ok
$name = $namer->find_name_by_response($LWP_response);
$name = $namer->find_name_by_response($LWP_response, $as_if_content_type);
$name = $namer->find_name_by_url($url, $suggested_name,
$content_type, $content_encoding);
$name = $namer->find_name_by_url($url, $suggested_name, $content_type);
$name = $namer->find_name_by_url($url, $suggested_name);
$name = $namer->find_name_by_url($url);
$namer_returns_undef = Net::ChooseFName->failer(); # Funny constructor
DESCRIPTION
This module helps to pick up a local file name for a remote resource (e.g., one downloaded from Internet). It turns out that this is a tricky business; keep in mind that most servers are misconfigured, most URLs are malformed, and most filesystems are limited w.r.t. possible filenames. As a result most downloaders fail to work in some situations since they choose names which are not supported on particular filesystems, or not useful for file:///
-related work.
Because of the many possible twists and ramifications, the design of this module is to be as much configurable as possible. One of ways of configurations is a rich system of options which influence different steps of the process. To cover cases when options are not flexible enough, the process is broken into many steps; each step is easily overridable by subclassing Net::ChooseFName
.
The defaults are chosen to be as safe as possible while not getting very much into the ways. For example, since %
is a special character on DOSish shells, to simplify working from command line on such systems, we avoid this letter in generated file names. Similarly, since MacOS has problems with filenames with 8-bit characters, we avoid them too; since may Unix programs have problem with spaces in file names, we massage them into underscores; the length of the longest file path component is restricted to 255 chars.
Note that in many situations it is advisable to make these restrictions yet stronger. For example, for copying to CD one should restrict names yet more (max_length => 64
); for copying to MSDOS file systems enable option '8+3' => 1
.
[In the description of methods the $self argument is omitted.]
Principal methods
- new(OPT1 => $val1, ...)
-
Constructor method. Creates an object with given options. Default values for the unspecified options are (comments list in which methods this option is used):
protect => # protect_characters() # $1 should contain the match qr/([?*|\"<>\\:?#\x00-\x1F\x7F-\xFF\[\])/, protect_pref => '@', # protect_characters(), protect_directory() root => '.', # find_directory() dir_mode => 0775, # directory_found() mkpath => 1, # directory_found() max_suff_len => 4, # split_suffix() 'jpeg' keepsuff_same_mediatype => 1, # choose_suffix() type_suff => # choose_suffix() {'text/ftp-dir-listing' => '.dirl'} keep_suff => { text/plain => 1, application/octet-stream => 1 }, short_suffices => # eight_plus_three() {jpeg => 'jpg', html => 'htm', 'tar.bz2' => 'tbz', 'tar.gz' => 'tgz'}, suggest_disposition => 1, # find_name_by_response() suggested_only_basename => 1, # find_name_by_response(), raw_name() fix_url_backslashes => 1, # protect_characters() max_length => 255, # fix_dups(), fix_component() cache_name => 1, # name_found() queryless_types => # url_takes_query() { map(($_ => 1), # http://filext.com/detaillist.php?extdetail=DJV 2005/01 qw(image/djvu image/x-djvu image/dejavu image/x-dejavu image/djvw image/x.djvu image/vnd.djvu ))}, queryless_ext => { 'djvu' => 1, 'djv' => 1 }, # url_takes_query()
The option
type_suff
is special so that the user-specified value is added to this hash, and not replaces it. Similarly, the value of optionhtml_suff
is used to populate the value fortext/html
of this hash.Other, options have
undef
as the default value. Their effects are documented in the documentation of the methods they affect. With the exception ofknown_names
, these options are booleans.html_suff # new() known_names # known_names() name_found(); hash ref or undef only_known # known_names() hierarchical # raw_name(), find_directory() use_query # raw_name() 8+3 # fix_basename(), fix_component() keep_space # fix_component() keep_dots # fix_component() tolower # fix_component() dir_query # find_directory() site_dir # find_directory() ignore_existing_files # fix_dups keep_nosuff, type_suff_no_enc, type_suff_fallback, type_suff_fallback_no_enc # choose_suffix()
Summary of the most useful in applications options (with defaults if applicable):
html_suff # Suffix for HTML (dot will be prepended) root => '.', # Where to put files? mkpath => 1, # Create directories with chosen names? max_length => 255, # Maximal length of a path component ignore_existing_files # Should the filename be "new"? cache_name => 1, # Return the same filename on the same URL, # even if file jumped to existence? hierarchical # Only the last component of URL path matters? suggested_only_basename => 1, # Should suggested name be relative the path? use_query # Do not ignore the query part of URL? # Value is used as (literal) prefix of query dir_query # Make the non-query part of URL a directory? site_dir # Put the hostname part of URL into directory? keepsuff_same_mediatype # Preserve the file extensions matching type? 8+3 # Is the filesystem DOSish? keep_space # Map spaces in URL to spaces in filenames? tolower # Translate filenames to lowercase? type_suff, type_suff_no_enc, type_suff_fallback, type_suff_fallback_no_enc, keep_suff, keep_nosuff # Hashes indexed by lowercased types; # Allow tuning choosing the suffix
- find_name_by_url($url, $suggested_name, $type, $enc)
-
This method returns a suitable filename for the resource given its URL. Optional arguments are a suggested name (possibly, it will be modified according to options of the object), the content-type, and the content-encoding of the resource. If multiple content-encodings are required, specify them as an array reference.
A chain of helper methods ("Transformation chain") is called to apply certain transformations to the name.
undef
is returned if any of the helper methods (except known_names() and protect_query()) return undefined values; the caller is free to interpret this as "load to memory", if appropriate. These helper methods are listed in the following section. - find_name_by_response($response [, $content_type])
-
This method returns name given an LWP response object (and, optionally, an overriding
Content-Type
). If optionsuggest_disposition
is TRUE, uses the headerContent-Disposition
from the response as the suggested name, then passes the fields from the response object to the method find_name_by_url().
Transformation chain
- url_2resource($url [, $type, $encoding])
-
This method returns $url modified by removing the parts related to access to parts of the resource. In particular, the fragment part is removed, as well as the query part if url_is_queryless() returns TRUE.
- known_names($url, $suggested, $type, $enc)
-
The method find_name_by_url() will return the return value of this method (unless undef) immediately. Unless overriden, this method returns the value of the hash option
known_names
indexed by the $url. (By default this hash is empty.)If the option
only_known
is true, it is a fatal error if $url is not a key of this hash. - raw_name($url, $suggested, $type, $enc)
-
Returns the 0th approximation to the filename of the resource; the return value has two parts: the principal part, and the query string (
undef
if should not be used).If $suggested is undefined, returns the path part of the $url, and the query part, if present and if option
use_query
is TRUE). Otherwise either returns $suggested, or (if optionssuggested_only_basename
andhierarchical
are both true), returns the path part of the $url with the last component changed to $suggested; the query part is ignored in this case. In the latter case, if optionsuggested_basename
is TRUE, only the last path component of $suggested is used. - protect_characters($f, $query, $url, $suggested, $type, $enc)
-
Returns the filename $f with necessary character-by-character translations performed. Unless overriden, it translates backslashes to slashes if the option
fix_url_backslashes
is TRUE, replaces characters matched by regular expression in the optionprotect
by their hexadecimal representation (with the leader being the value of the optionprotect_pref
), and replaces percent signs by the value of the optionprotect_pref
. - protect_query($f, $query, $url, $suggested, $type, $enc)
-
Returns $query with necessary character-by-character translations performed. Unless overriden, it translates slashes, backslashes, and characters matched byregular expression in the option
protect
by their hexadecimal representation (with the leader being the value of the optionprotect_pref
), and replaces percent signs by the value of the optionprotect_pref
. - find_directory($f, $query, $url, $suggested, $type, $enc)
-
Returns a triple of the appropriate directory name, the relative filename, and a string to append to the filename, based on processed-so-far filename $f and the $query string.
Unless overriden, does the following: unless the option
hierarchical
is TRUE, all but the last path components of $f are ignored. If the optionsite_dir
is TRUE, the host part of the URL (as well as the port part - if non-standard) are prepended to the filename. The leading backslash is always stripped, and the optionroot
is used as the lead components of the directory name. If $query is defined, and the optiondir_query
is true, $f is used as the last component of the directory, and $query as file name (with optionuse_query
prepended).(Dirname is assumed to be
/
-terminated.) - protect_directory($dirname, $f, $append, $url, $suggested, $type, $enc)
-
Returns the provisional directory part of the filename. Unless overriden, replaces empty components by the string
empty
preceeded by the value ofprotect_pref
option; then applies the method fix_component() to each component of the directory. - directory_found($dirname, $f, $append, $url, $suggested, $type, $enc)
-
A callback to process the calculated directory name. Unless overriden, it creates the directory (with permissions per option
dir_mode
) if the optionmkpath
is TRUE.Actually, the directory name is the return value, so this is the last chance to change the directory name...
- split_suffix($f, $dirname, $append, $url, $suggested, $type, $enc)
-
Breaks the last component $f of the filename into a pair of basename and suffix, which are returned. $dirname consists of other components of the filename, $append is the string to append to the basename in the future.
Suffix may be empty, and is supposed to contain the leading dot (if applicable); it may contain more than one dot. Unless overriden, the suffix consists of all trailing non-empty started-by-dot groups with length no more than given by the option
max_suff_len
(not including the leading dot). - choose_suffix($f, $suff, $dirname, $append, $url, $suggested, $type, $enc)
-
Returns a pair of basename and appropriate suffix for a file. $f is the basename of the file, $suff is its suffix, $dirname consists of other components of file names, $append is the string to append to the basename.
Different strategies applicable to this problem are:
keep the file extension;
replace by the "best" extension for this $type (and $enc);
replace by the user-specified type-specific extension.
Any of these has two variants: whether we want the encodings reflected in the suffix, or not. Unless overriden, chosing strategy/variant consists of several rounds.
In the first round, choose user-specified suffix if $type is defined, and is (lowercased) in the option-hashes
type_suff
andtype_suff_no_enc
(choosing the variant based on which hash matched). Keep the current suffix if $type is not defined, or optionkeepsuff_same_mediatype
is TRUE and the current suffix of the file matches $type and $enc (per database of known types and encodings).The second round runs if none of these was applicable. Choose user-specified suffix if $type is (lowercased) in the hashes
type_suff_fallback
ortype_suff_fallback_no_enc
(choosing variant as above); keep the current suffix if the type (lowercased) is in the hasheskeep_nosuff
orkeep_suff
(depending on whether $suff is empty or not).If none of these was applicable, the last round chooses the appropriate suffix by the database of known types and encodings; if not found, the existing suffix is preserved.
- fix_basename($f, $dirname, $suff, $url, $suggested, $type, $enc)
-
Returns a pair of basename and suffix for a file. $f is the last component of the name of the file, $dirname consists of other components. Unless overriden, this method replaces an empty basename by
"index"
and applies fix_component() method to the basename; finally, if'8+3'
otion is set, it converts the filename and suffix to a name suitable 8+3 filesystems. - fix_dups($f, $dirname, $suff, $url, $suggested, $type, $enc)
-
Given a basename, extension, and the directory part of the filename, modifies the basename (if needed) to avoid duplicates; should return the complete file name (combining the dirname, basename, and suffix). Unless overriden, appends a number to the basename (shortening basename if needed) so that the result is unique.
This is a prime candidate for overriding (e.g., to ask user for confirmation of overwrite).
- name_found($url, $f, $dirname, $suff, $suggested, $type, $enc)
-
The callback method to register the found name. Unless overridden, behaves like following: if option
cache_name
is TRUE, stores the found name in theknown_names
hash. Otherwise just returns the found name.
Helper methods
- fix_component($component, $isdir)
-
Returns a suitably modified value of a path component of a filename. The non-overriden method massages unescapes embedded SPACE characters; it removes starting/trailing, and converts the rest to
_
unless the optionkeep_space
is TRUE; removes trailing dots unless the optionkeep_dots
is TRUE; translates to lowercase if the optiontolower
is TRUE, truncates tomax_length
if this option is set, and applies the eight_plus_three() method if the option'8+3'
is set. - eight_plus_three($fname, $suffix)
-
Returns the value of filename modified for filesystems with 8+3 restriction on the filename (such as DOS). If $suffix is not given, calculates it from $fname; otherwise $suffix should include the leading dot, and $fname should have $suffix already removed. (Some parts of info may be moved between suffix and filename if judged appropriate.)
- url_takes_query($url [, $type, $encoding])
-
This method returns TRUE if the query part of the URL is selecting a part of the resource (i.e., if it is behaves as a fragment part, and it is the client which should process this part). Such URLs are detected by $type (should be in hash option
queryless_types
), or by extension of the last path component (should be in hash optionqueryless_ext
).
Net::ChooseFName::Failer class
A class which behaves as Net::ChooseFName, but always returns undef
. For convenience, the constructor is duplicated as a class method failer() in the class Net::ChooseFName.
EXPORT
None by default.
BUGS
Documentation keeps mentioning "unless overriden"... Of course it is a generic remark applicable to any method of any class; however, please remember that methods of this class are designed to be overriden.
There is no protection against a wanted directory name being already taken by a file.
There is no restriction on length of overall file name, only on length of a component name.
SEE ALSO
LWP=libwww-perl
AUTHOR
Ilya Zakharevich <ilyaz@cpan.org>
COPYRIGHT AND LICENSE
Copyright (C) 2005 by Ilya Zakharevich <ilyaz@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.