NAME
Glynx - a download manager.
DESCRIPTION
Glynx makes a local image of a selected part of the internet.
It can be used to make download lists to be used with other download managers, making a distributed download process.
It currently supports resume/retry, referer, user-agent, frames, distributed download (see --slave
, --stop
, --restart
).
It partially supports: redirect (using file-copy), java, javascript, multimedia, authentication (only basic), mirror, translating links to local computer (--makerel
), correcting file extensions, ftp, renaming too long filenames and too deep directories, cookies, proxy, forms.
A very basic cgi user interface is included.
No testing so far: "https:".
Tested on Linux and NT
SYNOPSIS
- Do-everything at once:
-
glynx.pl [options] <URL>
- Save work to finish later:
-
glynx.pl [options] --dump="dump-file" <URL>
- Finish saved download:
-
glynx.pl [options] "download-list-file"
- Network mode (client/slave)
- - Clients:
-
glynx.pl [options] --dump="dump-file" <URL>
- - Slaves (will wait until there is something to do):
-
glynx.pl [options] --slave
HINTS
How to create a default configuration:
Start the program with all command-line configurations, plus --cfg-save
or:
1 - start the program with --cfg-save
2 - edit glynx.ini file
--subst, --exclude and --loop use regular expressions.
http://www.site.com/old.htm --subst=s/old/new/
downloads: http://www.acme.com/new.htm
- Note: the substitution string MUST be made of "valid URL" characters
--exclude=/\.gif/
will not download ".gif" files
- Note: Multiple --exclude are allowed:
--exclude=/gif/ --exclude=/jpeg/
will not download ".gif" or ".jpeg" files
It can also be written as:
--exclude=/\.gif|\.jp.?g/i
matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG
--exclude=/www\.site\.com/
will not download links containing the site name
http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/
won't download outside from directory "/bin". Prefix must end with a slash "/".
http://www.site.com/index%%%.htm --loop=%%%:0..3
will download:
http://www.site.com/index0.htm
http://www.site.com/index1.htm
http://www.site.com/index2.htm
http://www.site.com/index3.htm
- Note: the substitution string MUST be made of "valid URL" characters
- For multiple exclusion: use "|".
- Don't read directory-index:
?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A => \?[DSMN]=[AD]
To change default "exclude" pattern - put it in the configuration file
Note: "File:" item in dump file is ignored
You can filter the processing of a dump file using --prefix, --exclude, --subst
If after finishing downloading you still have ".PART._BUSY_" files in the base directory, rename them to ".PART" (the program should do this by itself)
Don't do this: --depth=1 --out-depth=3 because "out-depth" is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3
This will do nothing:
--dump=x graphic.gif
because the dump file gets all binary files.
Errors using https:
[ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or
[ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]
This means you need to install at least "openssl" (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL
COMMAND-LINE OPTIONS
Check --help for default values.
Very basic:
--version Print version number and quit
--verbose More output
--quiet No output
--help Help page
--cfg-save Save configuration to file
--base-dir=DIR Place to load/save files
Download options are:
--sleep=SECS Sleep between gets, ie. go slowly
--prefix=PREFIX Limit URLs to those which begin with PREFIX
Multiple "--prefix" are allowed.
--depth=N Maximum depth to traverse
--out-depth=N Maximum depth to traverse outside of PREFIX
--referer=URI Set initial referer header
--limit=N A limit on the number documents to get
--retry=N Maximum number of retrys
--timeout=SECS Timeout value - increases on retrys
--agent=AGENT User agent name
--mirror Checks all existing files for updates
--nomirror Do not check for updates -- if file exists, it's ok
--mediaext Creates a file link, guessing the media type extension (.jpg, .gif)
(perl actually makes a file copy)
--nomediaext Do not try to change media type extension
--makerel Make Relative links. Links in pages will work in the
local computer.
--nomakerel Keep links as they are. Do not try to change links.
--auth=USER:PASS Set authentication credentials
--cookies=FILE Set up a cookies file (default is no cookies)
--name-len-max Limit filename size
--dir-depth-max Limit directory depth
Multi-process control:
--slave Wait until a download-list file is created (be a slave)
--stop Stop slave
--restart Stop and restart slave
Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):
--hier Download into hierarchy (not all files into cwd)
--iis Workaround IIS 2.0 bug by sending "Accept: */*" MIME
header; translates backslashes (\) to forward slashes (/)
--keepext=type Keep file extension for MIME types (comma-separated list)
--nospace Translate spaces URLs (not #fragments) to underscores (_)
--tolower Translate all URLs to lowercase (useful with IIS servers)
Other options: (to-be better explained)
--indexfile=FILE Index file in a directory
--part-suffix=.SUFFIX Extension to use for partial downloads
(example: ".Getright" ".PART")
--dump=FILE make download-list file, to be used later
--dump-max=N number of links per download-list file
--invalid-char=C Character to use in substitutions for invalid characters
--exclude=/REGEXP/i Don't download matching URLs
Multiple --exclude are allowed
--loop=REGEXP:INITIAL..FINAL Expand a URL through substitutions
(example: xx:a,b,c xx:'01'..'10')
--subst=s/REGEXP/VALUE/i Substitute some string in the urls.
--404-retry will retry on error 404 Not Found.
--no404-retry creates an empty file on error 404 Not Found.
COPYRIGHT
Copyright (c) 2000 Flavio Glock <fglock@pucrs.br>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.
If you use it/like it, send a postcard to the author.