NAME
Glynx - a download manager.
Download from http://www.ipct.pucrs.br/flavio/glynx/glynx-latest.pl
DESCRIPTION
Glynx makes a local image of a selected part of the internet.
It can be used to make download lists to be used with other download managers, making a distributed download process.
It currently supports resume, retry, referer, user-agent, java, frames, distributed download (see --slave
, --stop
, --restart
).
It partially supports redirect, javascript, multimedia, authentication
It does not support mirroring (checking file dates), forms
It has not been tested with "https" yet.
It should be better tested with "ftp".
Tested on Linux and NT
SYNOPSIS
- Do-everything at once:
-
$progname.pl [options] <URL>
- Save work to finish later:
-
$progname.pl [options] --dump="dump-file" <URL>
- Finish saved download:
-
$progname.pl [options] "download-list-file"
- Network mode (client/slave)
- - Clients:
-
$progname.pl [options] --dump="dump-file" <URL>
- - Slaves (will wait until there is something to do):
-
$progname.pl [options] --slave
HINTS
How to make a default configuration:
Start the program with all command-line configurations, plus --cfg-save
or:
1 - start the program with --cfg-save
2 - edit glynx.ini file
--subst, --exclude and --loop use regular expressions.
http://www.site.com/old.htm --subst=s/old/new/
downloads: http://www.acme.com/new.htm
- Note: the substitution string MUST be made of "valid URL" characters
--exclude=/\.gif/
will not download ".gif" files
- Note: Multiple --exclude are allowed:
--exclude=/gif/ --exclude=/jpeg/
will not download ".gif" or ".jpeg" files
It can also be written as:
--exclude=/\.gif|\.jp.?g/i
matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG
--exclude=/www\.site\.com/
will not download links containing the site name
http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/
won't download outside from directory "/bin". Prefix must end with a slash "/".
http://www.site.com/index%%%.htm --loop=%%%:0..3
will download:
http://www.site.com/index0.htm
http://www.site.com/index1.htm
http://www.site.com/index2.htm
http://www.site.com/index3.htm
- Note: the substitution string MUST be made of "valid URL" characters
- For multiple exclusion: use "|".
- Don't read directory-index:
?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A => \?[DSMN]=[AD]
To change default "exclude" pattern - put it in the configuration file
Note: "File:" item in dump file is ignored
You can filter the processing of a dump file using --prefix, --exclude, --subst
If after finishing downloading you still have ".PART._BUSY_" files in the base directory, rename them to ".PART" (the program should do this by itself)
Don't do this: --depth=1 --out-depth=3 because "out-depth" is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3
This will do nothing:
--dump=x graphic.gif
because the dump file gets all binary files.
Errors using https:
[ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or
[ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]
This means you need to install at least "openssl" (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL
COMMAND-LINE OPTIONS
Very basic:
--version Print version number ($VERSION) and quit
--verbose More output
--quiet No output
--help Help page
--cfg-save Save configuration to file "$CFG_FILE"
--base-dir=DIR Place to load/save files (default is "$BASE_DIR")
Download options are:
--sleep=SECS Sleep between gets, ie. go slowly (default is $SLEEP)
--prefix=PREFIX Limit URLs to those which begin with PREFIX (default is URL base)
Multiple "--prefix" are allowed.
--depth=N Maximum depth to traverse (default is $DEPTH)
--out-depth=N Maximum depth to traverse outside of PREFIX (default is $OUT_DEPTH)
--referer=URI Set initial referer header (default is "$REFERER")
--limit=N A limit on the number documents to get (default is $MAX_DOCS)
--retry=N Maximum number of retrys (default is $RETRY_MAX)
--timeout=SECS Timeout value - increases on retrys (default is $TIMEOUT)
--agent=AGENT User agent name (default is "$AGENT")
Multi-process control:
--slave Wait until a download-list file is created (be a slave)
--stop Stop slave
--restart Stop and restart slave
Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):
--auth=USER:PASS Set authentication credentials for web site
--hier Download into hierarchy (not all files into cwd)
--iis Workaround IIS 2.0 bug by sending "Accept: */*" MIME
header; translates backslashes (\) to forward slashes (/)
--keepext=type Keep file extension for MIME types (comma-separated list)
--nospace Translate spaces URLs (not #fragments) to underscores (_)
--tolower Translate all URLs to lowercase (useful with IIS servers)
Other options: (to-be better explained)
--indexfile=FILE Index file in a directory (default is "$INDEXFILE")
--part-suffix=.SUFFIX (default is "$PART_SUFFIX") (eg: ".Getright" ".PART")
--dump=FILE (default is "$DUMP") make download-list file,
to be used later
--dump-max=N (default is $DUMP_MAX) number of links per download-list file
--invalid-char=C (default is "$INVALID_CHAR")
--exclude=/REGEXP/i (default is "@EXCLUDE") Don't download matching URLs
Multiple --exclude are allowed
--loop=REGEXP:INITIAL..FINAL (default is "$LOOP") (eg: xx:a,b,c xx:'01'..'10')
--subst=s/REGEXP/VALUE/i (default is "$show_subst") (obs: "\" deve ser escrito "\\")
--404-retry will retry on error 404 Not Found (default).
--no404-retry creates an empty file on error 404 Not Found.
TO-DO
More command-line compatibility with lwp-rget
Graphical user interface
README
Glynx - a download manager.
Installation:
Windows:
- unzip to a directory, such as c:\glynx or even c:\temp
- this is a DOS script, it will not work properly if you double click it.
However, you can put it in the startup directory in "slave mode"
making a link with the --slave parameter. Then open another DOS window
to operate it as a client.
- the latest ActivePerl has all the modules needed, except for https.
Unix/Linux:
make a subdirectory and cd to it
tar -xzf Glynx-[version].tar.gz
chmod +x glynx.pl (if necessary)
pod2html glynx.pl -o=glynx.htm (this is optional)
- under RedHat 6.2 I had to upgrade or install these modules:
HTML::Tagset MIME:Base64 URI HTML::Parser Digest::MD5 libnet libwww-perl
- to use https you will need:
openssl (www.openssl.org) Net::SSLeay IO::Socket::SSL
Please note that the software will create many files in
its work directory, so it is advisable to have a dedicated
sub-directory for it.
Goals:
generalize
option to use (external) java and other script languages to extract links
configurable file names and suffixes
configurable dump file format
plugins
more protocols; download streams
language support
adhere to perl standards
pod documentation
distribution
difficult to understand, fun to write
parallelize things and multiple computer support
cpu and memory optimizations
accept hardware/internet failures
restartable
reduce internet traffic
minimize requests
cache everything
other (from perlhack.pod)
1. Keep it fast, simple, and useful.
2. Keep features/concepts as orthogonal as possible.
3. No arbitrary limits (platforms, data sizes, cultures).
4. Keep it open and exciting to use/patch/advocate Perl everywhere.
5. Either assimilate new technologies, or build bridges to them.
Problems (not bugs):
- It takes some time to start the program; not practical for small single file downloads.
- Command line only. It should have a graphical front-end; there exists a web front-end.
- Hard to install if you don't have Perl or have outdated Perl modules. It works fine
with Perl 5.6 modules.
- slave mode uses "dump files", and doesn't delete them.
To-do (long list):
Bugs/debug/testing:
- put // on exclude, etc if they don't have
- arrays for $LOOP,$SUBST; accept multiple URL
- Doesn't recreate unix links on "ftp".
Should do that instead of duplicating files (same on http redirects).
- uses Accept:text/html to ask for an html listing of the directory when
in "ftp" mode. This will have to be changed to "text/ftp-dir-listing" if
we implement unix links.
- install and test "https"
- accept --url=http://...
- accept --batch=...grx
- ignore/accept comments: <! a href="..."> - nested comments???
- http server to make distributed downloads across the internet
- use eval to avoid fatal errors; test for valid protocols
- rename "old" .grx._BUSY_ files to .grx (timeout = 1 day?)
option: touch busy file to show activity
- don't ignore "File:"
- unknown protocol is a fatal error
- test: counting MAX_DOCS with retry
- test: base-dir, out-depth, site leakage
- test: authentication
- test: redirect 3xx
usar: www.ig.com.br ?
- change the retry loop to a "while"
- timeout changes after "slave"
- leitura da configuracao:
(1) le opcoes da linha de comando (pode trocar o arquivo .ini),
(2) le configuracao .ini,
(3) le opcoes da linha de comando de novo (pode ser override .ini),
(4) le download-list-file
(5) le opcoes da linha de comando de novo (pode ser override download-list-file)
- execute/override download-list-file "File:"
opcao: usar --subst=/k:\\temp/c:\\download/
Generalization, user-interface:
- opcao no-download para reprocessar o cache
- arquivo de log opcional para guardar os headers.
Opcao: filename._HEADER_; --log-headers
- make it a Perl module (crawler, robot?), generic, re-usable
- option to understand robot-rules
- make .glynx the default suffix for everything
- try to support <form> through download-list-file
- support mirroring (checking file dates)
- internal small javascript interpreter
- perl/tk front-end; finish web front end
- config comment-string in download-list-file
- config comment/uncomment for directives
- arquivo default para dump sem parametros - "dump-[n]-1"?
- more configuration parameters
- opcao portugues/ingles?
- plugins: for each chunk, page, link, new site, level change, dump file change,
max files, on errors, retry level change. Opcao: usar callbacks.
- dump suffix option
- javascript interpreter option
- scripting option (execute sequentially instead of parallel)
- use environment
- aceitar configuracao --nofollow="shtml" e --follow="xxx"
- controle de hora, bytes por segundo
- protocolo pnm: - realvideo, arquivos .rpm
- streams
- gnutella
- 401 Authentication Required, generalize abort-on-error list,
support --auth= (see rget)
- opcao para reescrever paginas html com links relativos
Standards/perl:
- packaging for distribution, include rfcs, etc?
- include a default ini file in package
- include web front-end in package?
- installation hints, package version problems (abs_url)
- more english writing
- include all lwp-rget options, or ignore without exiting
- criar um objeto para as listas de links - escolher e especializar um existente.
- check: 19.4.5 HTTP Header Fields in Multipart Body-Parts
Content-Encoding
Persistent connections: Connection-header
Accept: */*, *.*
- documentar melhor o uso de "\" em exclude e subst
- ler, enviar, configurar cookies
Network/parallel support:
- timed downloads - start/stop hours
- gravar arquivo "to-do" durante o processamento,
para poder retomar em caso de interrupcao.
ex: a cada 10 minutos
- integrar com "k:\download"
- receber / enviar comando restart / stop.
Speed optimizations:
- use an optional database connection
- Persistent connections;
- take a look at LWP::ParallelUserAgent
- take a look at LWPng for simultaneous file transfers
- take a look at LWP::Sitemapper
- use eval around things do speed up program loading
- opcao: pilhas diferentes dependendo do tipo de arquivo ou site, para acelerar a procura
Other:
- forms / PUT
- Renomear a extensao de acordo com o mime-type (ou copiar para o outro nome).
configuracao: --on-redirect=rename
--on-redirect=copy
--on-mime=rename
--on-mime=copy
- configurar tamanho maximo da URL
- configurar profundidade maxima de subdiretorios
- tamanho maximo do arquivo recebido
- disco cheio / alternate dir
- "--proxy=http:"1.1.1.1",ftp:"1.1.1.1"
"--proxy="1.1.1.1"
acessar proxy: $ua->proxy(...) Set/retrieve proxy URL for a scheme:
$ua->proxy(['http', 'ftp'], 'http://proxy.sn.no:8001/');
$ua->proxy('gopher', 'http://proxy.sn.no:8001/');
- enable "--no-[option]"
- accept empty "--dump" or "--no-dump" / "--nodump"
--max-mb=100
limita o tamanho total do download
--auth=USER:PASS
nao e' realmente necessario, pode estar dentro da URL
existe no lwp-rget
--nospace
permite links com espacos no nome (ver lwp-rget)
--relative-links
opcao para refazer os links para relativo
--include=".exe" --nofollow=".shtml" --follow=".htm"
opcoes de inclusao de arquivos (procurar links dentro)
--full ou --depth=full
opcao site inteiro
--chunk=128000
--dump-all
grava todos os links, incluindo os ja existentes e paginas processadas
Version history:
1.022:
- multiple --prefix and --exclude seems to be working
- uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode.
- corrected errors creating directory and copying file on linux
1.021:
- uses URI::Heuristic on command-line URL
- shows error response headers (if verbose)
- look at the 3rd parameter on 206 (when available -- otherwise it gives 500),
Content-Length: 637055 --> if "206" this is "chunk" size
Content-Range: bytes 1449076-2086130/2086131 --> THIS is file size
- prefix of: http://rd.yahoo.com/footer/?http://travel.yahoo.com/
should be: http://rd.yahoo.com/footer/
- included: "wav"
- sleep had 1 extra second
- sleep makes tests even when sleep==0
1.020: oct-02-2000
- optimization: accepts 200, when expecting 206
- don't keep retrying when there is nothing to do
- 404 Not Found error sometimes means "can't connect" - uses "--404-retry"
- file read = binmode
1.019: - restart if program was modified (-M $0)
- include "mov"
- stop, restart
1.018: - better copy, rename and unlink
- corrected binary dump when slave
- comparacao de tamanho de arquivos corrigida
- span e' um comando de css, que funciona como "a" (a href == span href);
span class is not java
1.017: - sleep prints dots if verbose.
- daemon mode (--slave)
- url and input file are optional
1.016: sept-27-2000
- new name "glynx.pl"
- verbose/quiet
- exponential timeout on retry
- storage control is a bit more efficient
- you can filter the processing of a dump file using prefix, exclude, subst
- more things in english, lots of new "to-do"; "goals" section
- rename config file to glynx.ini
1.015: - first published version, under name "get.pl"
- rotina unica de push/shift sem repeticao
- traduzido parcialmente para ingles, revisao das mensagens
1.014: - verifica inside antes de incluir o link
- corrige numeracao dos arquivos dump
- header "Location", "Content-Base"
- revisado "Content-Location"
1.013: - para otimizar: retirar repeticoes dentro da pagina
- incluido "png"
- cria/testa arquivo "not-found"
- processa Content-Location - TESTAR - achar um site que use
- incluido tipo "swf", "dcr" (shockwave) e "css" (style sheet)
- corrige http://host/../file gravado em ./host/../file => ./file
- retira caracteres estranhos vindos do javascript: ' ;
- os retrys pendentes sao gravados somente no final.
- (1) le opcoes, (2) le configuracao, (3) le opcoes de novo
1.012: - segmenta o arquivo dump durante o processamento, permitindo iniciar o
download em paralelo a partir de outro processo/computador antes que a tarefa esteja
totalmente terminada
- utiliza indice para gravar o dump; nao destroi a lista que esta na memoria.
- salva a configuracao completa junto com o dump;
- salva/le get.ini
1.011: corrige autenticacao (prefix)
corrige dump
le dump
salva/le $OUT_DEPTH, depth (individual), prefix no arquivo dump
1.010: resume
se o site nao tem resume, tenta de novo e escolhe o melhor resultado (ideia do Silvio)
1.009: 404 not found nao enviado para o dump
processa arquivo se o tipo mime for text/html (nao funciona para o cache)
muda o referer dos links dependendo da base da resposta (redirect)
considera arquivos de tamanho zero como "nao no cache"
gera nome _INDEX_.HTM quando o final da URL tem "/".
1.008: trabalha internamente com URL absolutas
corrige vazamento quando out-nivel=0
1.007: segmenta o arquivo dump
acelera a procura em @processed
corrige o nome do diretorio no arquivo dump
Other problems - design decisions to make
- se usar '' no eval nao precisa de \\ ?
- paginas html redirecionadas devem receber um tag <BASE> no texto?
- montar links usando java ?
- a biblioteca perl faz sozinha Redirection 3xx ?
- usar File::Path para criar diretorios ?
- applets sempre tem .class no fim?
- file names excessivamente longos - o que fazer?
- usar: $ua->max_size([$bytes]) - nao funciona com callback
- mudar o filename se a base da resposta e diferente?
- criar arquivo PART com tamanho zero quando da erro 408 - timeout
- como e' o formato dump do go!zilla?
COPYRIGHT
Copyright (c) 2000 Flavio Glock <fglock@pucrs.br>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.
If you use it/like it, send a postcard to the author.