NAME

Net::ChooseFName - Perl extension for choosing a name of a local mirror of a net (e.g., FTP or HTTP) resource.

SYNOPSIS

  use Net::ChooseFName;
  $namer = Net::ChooseFName->new(max_length => 64);	# Copies to CD ok

  $name = $namer->find_name_by_response($LWP_response);
  $name = $namer->find_name_by_response($LWP_response, $as_if_content_type);

  $name = $namer->find_name_by_url($url, $suggested_name,
				   $content_type, $content_encoding);
  $name = $namer->find_name_by_url($url, $suggested_name, $content_type);
  $name = $namer->find_name_by_url($url, $suggested_name);
  $name = $namer->find_name_by_url($url);

  $namer_returns_undef = Net::ChooseFName->failer();	# Funny constructor

DESCRIPTION

This module helps to pick up a local file name for a remote resource (e.g., one downloaded from Internet). It turns out that this is a tricky business; keep in mind that most servers are misconfigured, most URLs are malformed, and most filesystems are limited w.r.t. possible filenames. As a result most downloaders fail to work in some situations since they choose names which are not supported on particular filesystems, or not useful for file:///-related work.

Because of the many possible twists and ramifications, the design of this module is to be as much configurable as possible. One of ways of configurations is a rich system of options which influence different steps of the process. To cover cases when options are not flexible enough, the process is broken into many steps; each step is easily overridable by subclassing Net::ChooseFName.

The defaults are chosen to be as safe as possible while not getting very much into the ways. For example, since % is a special character on DOSish shells, to simplify working from command line on such systems, we avoid this letter in generated file names. Similarly, since MacOS has problems with filenames with 8-bit characters, we avoid them too; since may Unix programs have problem with spaces in file names, we massage them into underscores; the length of the longest file path component is restricted to 255 chars.

Note that in many situations it is advisable to make these restrictions yet stronger. For example, for copying to CD one should restrict names yet more (max_length => 64); for copying to MSDOS file systems enable option '8+3' => 1.

[In the description of methods the $self argument is omitted.]

Principal methods

new(OPT1 => $val1, ...)

Constructor method. Creates an object with given options. Default values for the unspecified options are (comments list in which methods this option is used):

  protect	=>		# protect_characters()
				# $1 should contain the match
		   qr/([?*|\"<>\\:?#\x00-\x1F\x7F-\xFF\[\])/,
  protect_pref	=> '@',		# protect_characters(), protect_directory()
  root		=> '.',		# find_directory()
  dir_mode	=> 0775,	# directory_found()
  mkpath	=> 1,		# directory_found()
  max_suff_len	=> 4,		# split_suffix()	'jpeg'
  keepsuff_same_mediatype => 1,	# choose_suffix()
  type_suff	=>		# choose_suffix()
		   {'text/ftp-dir-listing' => '.dirl'}
  keep_suff	=> { text/plain => 1,
		     application/octet-stream => 1 },
  short_suffices =>		# eight_plus_three()
		   {jpeg => 'jpg', html => 'htm',
		    'tar.bz2' => 'tbz', 'tar.gz' => 'tgz'},
  suggest_disposition => 1,	# find_name_by_response()
  suggested_only_basename => 1,	# find_name_by_response(), raw_name()
  fix_url_backslashes => 1,	# protect_characters()
  max_length	=> 255,		# fix_dups(), fix_component()
  cache_name	=> 1,		# name_found()
  queryless_types =>		# url_takes_query()
	 { map(($_ => 1),	# http://filext.com/detaillist.php?extdetail=DJV 2005/01
	       qw(image/djvu image/x-djvu image/dejavu image/x-dejavu
		  image/djvw image/x.djvu image/vnd.djvu ))},
  queryless_ext	=> { 'djvu' => 1, 'djv' => 1 }, # url_takes_query()

The option type_suff is special so that the user-specified value is added to this hash, and not replaces it. Similarly, the value of option html_suff is used to populate the value for text/html of this hash.

Other, options have undef as the default value. Their effects are documented in the documentation of the methods they affect. With the exception of known_names, these options are booleans.

html_suff			# new()
known_names			# known_names() name_found(); hash ref or undef
only_known			# known_names()
hierarchical			# raw_name(), find_directory()
use_query			# raw_name()
8+3				# fix_basename(), fix_component()
keep_space			# fix_component()
keep_dots			# fix_component()
tolower			# fix_component()
dir_query			# find_directory()
site_dir			# find_directory()
ignore_existing_files		# fix_dups

keep_nosuff, type_suff_no_enc, type_suff_fallback,
type_suff_fallback_no_enc	# choose_suffix()

Summary of the most useful in applications options (with defaults if applicable):

  html_suff			# Suffix for HTML (dot will be prepended)
  root		=> '.',		# Where to put files?
  mkpath	=> 1,		# Create directories with chosen names?
  max_length	=> 255,		# Maximal length of a path component
  ignore_existing_files		# Should the filename be "new"?
  cache_name	=> 1,		# Return the same filename on the same URL,
				#   even if file jumped to existence?
  hierarchical			# Only the last component of URL path matters?
  suggested_only_basename => 1,	# Should suggested name be relative the path?
  use_query			# Do not ignore the query part of URL?
				# Value is used as (literal) prefix of query
  dir_query			# Make the non-query part of URL a directory?
  site_dir			# Put the hostname part of URL into directory?
  keepsuff_same_mediatype	# Preserve the file extensions matching type?
  8+3				# Is the filesystem DOSish?
  keep_space			# Map spaces in URL to spaces in filenames?
  tolower			# Translate filenames to lowercase?

  type_suff, type_suff_no_enc, type_suff_fallback, type_suff_fallback_no_enc,
  keep_suff, keep_nosuff	# Hashes indexed by lowercased types;
				# Allow tuning choosing the suffix
find_name_by_url($url, $suggested_name, $type, $enc)

This method returns a suitable filename for the resource given its URL. Optional arguments are a suggested name (possibly, it will be modified according to options of the object), the content-type, and the content-encoding of the resource. If multiple content-encodings are required, specify them as an array reference.

A chain of helper methods ("Transformation chain") is called to apply certain transformations to the name. undef is returned if any of the helper methods (except known_names() and protect_query()) return undefined values; the caller is free to interpret this as "load to memory", if appropriate. These helper methods are listed in the following section.

find_name_by_response($response [, $content_type])

This method returns name given an LWP response object (and, optionally, an overriding Content-Type). If option suggest_disposition is TRUE, uses the header Content-Disposition from the response as the suggested name, then passes the fields from the response object to the method find_name_by_url().

Transformation chain

url_2resource($url [, $type, $encoding])

This method returns $url modified by removing the parts related to access to parts of the resource. In particular, the fragment part is removed, as well as the query part if url_is_queryless() returns TRUE.

known_names($url, $suggested, $type, $enc)

The method find_name_by_url() will return the return value of this method (unless undef) immediately. Unless overriden, this method returns the value of the hash option known_names indexed by the $url. (By default this hash is empty.)

If the option only_known is true, it is a fatal error if $url is not a key of this hash.

raw_name($url, $suggested, $type, $enc)

Returns the 0th approximation to the filename of the resource; the return value has two parts: the principal part, and the query string (undef if should not be used).

If $suggested is undefined, returns the path part of the $url, and the query part, if present and if option use_query is TRUE). Otherwise either returns $suggested, or (if options suggested_only_basename and hierarchical are both true), returns the path part of the $url with the last component changed to $suggested; the query part is ignored in this case. In the latter case, if option suggested_basename is TRUE, only the last path component of $suggested is used.

protect_characters($f, $query, $url, $suggested, $type, $enc)

Returns the filename $f with necessary character-by-character translations performed. Unless overriden, it translates backslashes to slashes if the option fix_url_backslashes is TRUE, replaces characters matched by regular expression in the option protect by their hexadecimal representation (with the leader being the value of the option protect_pref), and replaces percent signs by the value of the option protect_pref.

protect_query($f, $query, $url, $suggested, $type, $enc)

Returns $query with necessary character-by-character translations performed. Unless overriden, it translates slashes, backslashes, and characters matched byregular expression in the option protect by their hexadecimal representation (with the leader being the value of the option protect_pref), and replaces percent signs by the value of the option protect_pref.

find_directory($f, $query, $url, $suggested, $type, $enc)

Returns a triple of the appropriate directory name, the relative filename, and a string to append to the filename, based on processed-so-far filename $f and the $query string.

Unless overriden, does the following: unless the option hierarchical is TRUE, all but the last path components of $f are ignored. If the option site_dir is TRUE, the host part of the URL (as well as the port part - if non-standard) are prepended to the filename. The leading backslash is always stripped, and the option root is used as the lead components of the directory name. If $query is defined, and the option dir_query is true, $f is used as the last component of the directory, and $query as file name (with option use_query prepended).

(Dirname is assumed to be /-terminated.)

protect_directory($dirname, $f, $append, $url, $suggested, $type, $enc)

Returns the provisional directory part of the filename. Unless overriden, replaces empty components by the string empty preceeded by the value of protect_pref option; then applies the method fix_component() to each component of the directory.

directory_found($dirname, $f, $append, $url, $suggested, $type, $enc)

A callback to process the calculated directory name. Unless overriden, it creates the directory (with permissions per option dir_mode) if the option mkpath is TRUE.

Actually, the directory name is the return value, so this is the last chance to change the directory name...

split_suffix($f, $dirname, $append, $url, $suggested, $type, $enc)

Breaks the last component $f of the filename into a pair of basename and suffix, which are returned. $dirname consists of other components of the filename, $append is the string to append to the basename in the future.

Suffix may be empty, and is supposed to contain the leading dot (if applicable); it may contain more than one dot. Unless overriden, the suffix consists of all trailing non-empty started-by-dot groups with length no more than given by the option max_suff_len (not including the leading dot).

choose_suffix($f, $suff, $dirname, $append, $url, $suggested, $type, $enc)

Returns a pair of basename and appropriate suffix for a file. $f is the basename of the file, $suff is its suffix, $dirname consists of other components of file names, $append is the string to append to the basename.

Different strategies applicable to this problem are:

  • keep the file extension;

  • replace by the "best" extension for this $type (and $enc);

  • replace by the user-specified type-specific extension.

Any of these has two variants: whether we want the encodings reflected in the suffix, or not. Unless overriden, chosing strategy/variant consists of several rounds.

In the first round, choose user-specified suffix if $type is defined, and is (lowercased) in the option-hashes type_suff and type_suff_no_enc (choosing the variant based on which hash matched). Keep the current suffix if $type is not defined, or option keepsuff_same_mediatype is TRUE and the current suffix of the file matches $type and $enc (per database of known types and encodings).

The second round runs if none of these was applicable. Choose user-specified suffix if $type is (lowercased) in the hashes type_suff_fallback or type_suff_fallback_no_enc (choosing variant as above); keep the current suffix if the type (lowercased) is in the hashes keep_nosuff or keep_suff (depending on whether $suff is empty or not).

If none of these was applicable, the last round chooses the appropriate suffix by the database of known types and encodings; if not found, the existing suffix is preserved.

fix_basename($f, $dirname, $suff, $url, $suggested, $type, $enc)

Returns a pair of basename and suffix for a file. $f is the last component of the name of the file, $dirname consists of other components. Unless overriden, this method replaces an empty basename by "index" and applies fix_component() method to the basename; finally, if '8+3' otion is set, it converts the filename and suffix to a name suitable 8+3 filesystems.

fix_dups($f, $dirname, $suff, $url, $suggested, $type, $enc)

Given a basename, extension, and the directory part of the filename, modifies the basename (if needed) to avoid duplicates; should return the complete file name (combining the dirname, basename, and suffix). Unless overriden, appends a number to the basename (shortening basename if needed) so that the result is unique.

This is a prime candidate for overriding (e.g., to ask user for confirmation of overwrite).

name_found($url, $f, $dirname, $suff, $suggested, $type, $enc)

The callback method to register the found name. Unless overridden, behaves like following: if option cache_name is TRUE, stores the found name in the known_names hash. Otherwise just returns the found name.

Helper methods

fix_component($component, $isdir)

Returns a suitably modified value of a path component of a filename. The non-overriden method massages unescapes embedded SPACE characters; it removes starting/trailing, and converts the rest to _ unless the option keep_space is TRUE; removes trailing dots unless the option keep_dots is TRUE; translates to lowercase if the option tolower is TRUE, truncates to max_length if this option is set, and applies the eight_plus_three() method if the option '8+3' is set.

eight_plus_three($fname, $suffix)

Returns the value of filename modified for filesystems with 8+3 restriction on the filename (such as DOS). If $suffix is not given, calculates it from $fname; otherwise $suffix should include the leading dot, and $fname should have $suffix already removed. (Some parts of info may be moved between suffix and filename if judged appropriate.)

url_takes_query($url [, $type, $encoding])

This method returns TRUE if the query part of the URL is selecting a part of the resource (i.e., if it is behaves as a fragment part, and it is the client which should process this part). Such URLs are detected by $type (should be in hash option queryless_types), or by extension of the last path component (should be in hash option queryless_ext).

Net::ChooseFName::Failer class

A class which behaves as Net::ChooseFName, but always returns undef. For convenience, the constructor is duplicated as a class method failer() in the class Net::ChooseFName.

EXPORT

None by default.

BUGS

Documentation keeps mentioning "unless overriden"... Of course it is a generic remark applicable to any method of any class; however, please remember that methods of this class are designed to be overriden.

There is no protection against a wanted directory name being already taken by a file.

There is no restriction on length of overall file name, only on length of a component name.

SEE ALSO

LWP=libwww-perl

AUTHOR

Ilya Zakharevich <ilyaz@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Ilya Zakharevich <ilyaz@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.2 or, at your option, any later version of Perl 5 you may have available.