NAME

Glynx - a download manager.

Download from http://www.ipct.pucrs.br/flavio/glynx/glynx-latest.pl

DESCRIPTION

Glynx makes a local image of a selected part of the internet.

It can be used to make download lists to be used with other download managers, making a distributed download process.

It currently supports resume, retry, referer, user-agent, java, frames, distributed download (see --slave, --stop, --restart).

It partially supports redirect, javascript, multimedia, authentication

It does not support mirroring (checking file dates), forms

It has not been tested with "https" yet.

It should be better tested with "ftp".

Tested on Linux and NT

SYNOPSIS

Do-everything at once:

$progname.pl [options] <URL>

Save work to finish later:

$progname.pl [options] --dump="dump-file" <URL>

Finish saved download:

$progname.pl [options] "download-list-file"

Network mode (client/slave)

- Clients:

$progname.pl [options] --dump="dump-file" <URL>

- Slaves (will wait until there is something to do):

$progname.pl [options] --slave

HINTS

How to make a default configuration:

	Start the program with all command-line configurations, plus --cfg-save
	or:
 	1 - start the program with --cfg-save
	2 - edit glynx.ini file

--subst, --exclude and --loop use regular expressions.

http://www.site.com/old.htm --subst=s/old/new/
downloads: http://www.acme.com/new.htm

- Note: the substitution string MUST be made of "valid URL" characters

--exclude=/\.gif/
will not download ".gif" files

- Note: Multiple --exclude are allowed:

--exclude=/gif/  --exclude=/jpeg/
will not download ".gif" or ".jpeg" files

It can also be written as:
--exclude=/\.gif|\.jp.?g/i
matching .gif, .GIF, .jpg, .jpeg, .JPG, .JPEG

--exclude=/www\.site\.com/
will not download links containing the site name

http://www.site.com/bin/index.htm --prefix=http://www.site.com/bin/
won't download outside from directory "/bin". Prefix must end with a slash "/".

http://www.site.com/index%%%.htm --loop=%%%:0..3
will download:
  http://www.site.com/index0.htm
  http://www.site.com/index1.htm
  http://www.site.com/index2.htm
  http://www.site.com/index3.htm

- Note: the substitution string MUST be made of "valid URL" characters

- For multiple exclusion: use "|".

- Don't read directory-index:

   	?D=D ?D=A ?S=D ?S=A ?M=D ?M=A ?N=D ?N=A =>  \?[DSMN]=[AD] 

	To change default "exclude" pattern - put it in the configuration file

Note: "File:" item in dump file is ignored

You can filter the processing of a dump file using --prefix, --exclude, --subst

If after finishing downloading you still have ".PART._BUSY_" files in the base directory, rename them to ".PART" (the program should do this by itself)

Don't do this: --depth=1 --out-depth=3 because "out-depth" is an upper limit; it is tested after depth is generated. The right way is: --depth=4 --out-depth=3

This will do nothing:

--dump=x graphic.gif

because the dump file gets all binary files.

Errors using https:

[ ERROR 501 Protocol scheme 'https' is not supported => LATER ] or
[ ERROR 501 Can't locate object method "new" via package "LWP::Protocol::https" => LATER ]

This means you need to install at least "openssl" (http://www.openssl.org), Net::SSLeay and IO::Socket::SSL

COMMAND-LINE OPTIONS

Very basic:

--version         Print version number ($VERSION) and quit
--verbose         More output
--quiet           No output
--help            Help page
--cfg-save        Save configuration to file "$CFG_FILE"
--base-dir=DIR    Place to load/save files (default is "$BASE_DIR")

Download options are:

--sleep=SECS      Sleep between gets, ie. go slowly (default is $SLEEP)
--prefix=PREFIX   Limit URLs to those which begin with PREFIX (default is URL base)
                  Multiple "--prefix" are allowed.
--depth=N         Maximum depth to traverse (default is $DEPTH)
--out-depth=N     Maximum depth to traverse outside of PREFIX (default is $OUT_DEPTH)
--referer=URI     Set initial referer header (default is "$REFERER")
--limit=N         A limit on the number documents to get (default is $MAX_DOCS)
--retry=N         Maximum number of retrys (default is $RETRY_MAX)
--timeout=SECS    Timeout value - increases on retrys (default is $TIMEOUT)
--agent=AGENT     User agent name (default is "$AGENT")

Multi-process control:

--slave           Wait until a download-list file is created (be a slave)
--stop            Stop slave
--restart         Stop and restart slave

Not implemented yet but won't generate fatal errors (compatibility with lwp-rget):

--auth=USER:PASS  Set authentication credentials for web site
--hier            Download into hierarchy (not all files into cwd)
--iis             Workaround IIS 2.0 bug by sending "Accept: */*" MIME
                  header; translates backslashes (\) to forward slashes (/)
--keepext=type    Keep file extension for MIME types (comma-separated list)
--nospace         Translate spaces URLs (not #fragments) to underscores (_)
--tolower         Translate all URLs to lowercase (useful with IIS servers)

Other options: (to-be better explained)

--indexfile=FILE  Index file in a directory (default is "$INDEXFILE")
--part-suffix=.SUFFIX (default is "$PART_SUFFIX") (eg: ".Getright" ".PART")
--dump=FILE       (default is "$DUMP") make download-list file, 
                  to be used later
--dump-max=N      (default is $DUMP_MAX) number of links per download-list file 
--invalid-char=C  (default is "$INVALID_CHAR")
--exclude=/REGEXP/i (default is "@EXCLUDE") Don't download matching URLs
                  Multiple --exclude are allowed
--loop=REGEXP:INITIAL..FINAL (default is "$LOOP") (eg: xx:a,b,c  xx:'01'..'10')
--subst=s/REGEXP/VALUE/i (default is "$show_subst") (obs: "\" deve ser escrito "\\")
--404-retry       will retry on error 404 Not Found (default). 
--no404-retry     creates an empty file on error 404 Not Found.

TO-DO

More command-line compatibility with lwp-rget

Graphical user interface

README

Glynx - a download manager.

Installation:

    Windows:
	- unzip to a directory, such as c:\glynx or even c:\temp
	- this is a DOS script, it will not work properly if you double click it.
	However, you can put it in the startup directory in "slave mode" 
	making a link with the --slave parameter. Then open another DOS window
	to operate it as a client. 
	- the latest ActivePerl has all the modules needed, except for https.

    Unix/Linux:

	make a subdirectory and cd to it
	tar -xzf Glynx-[version].tar.gz
	chmod +x glynx.pl                 (if necessary)
	pod2html glynx.pl -o=glynx.htm	  (this is optional)

	- under RedHat 6.2 I had to upgrade or install these modules:
	HTML::Tagset MIME:Base64 URI HTML::Parser Digest::MD5 libnet libwww-perl

	- to use https you will need:
	openssl (www.openssl.org) Net::SSLeay IO::Socket::SSL

    Please note that the software will create many files in 
    its work directory, so it is advisable to have a dedicated 
    sub-directory for it.

Goals:

	generalize 
		option to use (external) java and other script languages to extract links
		configurable file names and suffixes
		configurable dump file format
		plugins
		more protocols; download streams
		language support
	adhere to perl standards 
		pod documentation
		distribution
		difficult to understand, fun to write
	parallelize things and multiple computer support
	cpu and memory optimizations
	accept hardware/internet failures
		restartable
	reduce internet traffic
		minimize requests
		cache everything
	other (from perlhack.pod)
 		1. Keep it fast, simple, and useful.
		2. Keep features/concepts as orthogonal as possible.
		3. No arbitrary limits (platforms, data sizes, cultures).
		4. Keep it open and exciting to use/patch/advocate Perl everywhere.
		5. Either assimilate new technologies, or build bridges to them.

Problems (not bugs):

- It takes some time to start the program; not practical for small single file downloads.
- Command line only. It should have a graphical front-end; there exists a web front-end.
- Hard to install if you don't have Perl or have outdated Perl modules. It works fine
  with Perl 5.6 modules.
- slave mode uses "dump files", and doesn't delete them.

To-do (long list):

	Bugs/debug/testing:
		- put // on exclude, etc if they don't have
		- arrays for $LOOP,$SUBST; accept multiple URL
		- Doesn't recreate unix links on "ftp". 
		Should do that instead of duplicating files (same on http redirects).
		- uses Accept:text/html to ask for an html listing of the directory when 
		in "ftp" mode. This will have to be changed to "text/ftp-dir-listing" if
		we implement unix links.
		- install and test "https"
		- accept --url=http://...
		- accept --batch=...grx
		- ignore/accept comments: <! a href="..."> - nested comments???
		- http server to make distributed downloads across the internet
		- use eval to avoid fatal errors; test for valid protocols
		- rename "old" .grx._BUSY_ files to .grx (timeout = 1 day?)
		  option: touch busy file to show activity
		- don't ignore "File:" 
		- unknown protocol is a fatal error
 		- test: counting MAX_DOCS with retry
 		- test: base-dir, out-depth, site leakage
		- test: authentication
		- test: redirect 3xx
			usar: www.ig.com.br ?
		- change the retry loop to a "while"
		- timeout changes after "slave"
		- leitura da configuracao:
		  (1) le opcoes da linha de comando (pode trocar o arquivo .ini), 
		  (2) le configuracao .ini, 
		  (3) le opcoes da linha de comando de novo (pode ser override .ini),
		  (4) le download-list-file
		  (5) le opcoes da linha de comando de novo (pode ser override download-list-file)
		- execute/override download-list-file "File:"
		  opcao: usar --subst=/k:\\temp/c:\\download/
	Generalization, user-interface:
		- opcao no-download para reprocessar o cache
		- arquivo de log opcional para guardar os headers. 
		  Opcao: filename._HEADER_; --log-headers
		- make it a Perl module (crawler, robot?), generic, re-usable 
		- option to understand robot-rules
		- make .glynx the default suffix for everything
		- try to support <form> through download-list-file
		- support mirroring (checking file dates)
		- internal small javascript interpreter
		- perl/tk front-end; finish web front end
		- config comment-string in download-list-file
		- config comment/uncomment for directives
 		- arquivo default para dump sem parametros - "dump-[n]-1"?
		- more configuration parameters
 		- opcao portugues/ingles?
		- plugins: for each chunk, page, link, new site, level change, dump file change, 
	  	  max files, on errors, retry level change. Opcao: usar callbacks.
		- dump suffix option
		- javascript interpreter option
		- scripting option (execute sequentially instead of parallel)
		- use environment
 		- aceitar configuracao --nofollow="shtml" e --follow="xxx"
 		- controle de hora, bytes por segundo
 		- protocolo pnm: - realvideo, arquivos .rpm
 		- streams
 		- gnutella
 		- 401 Authentication Required, generalize abort-on-error list, 
		  support --auth= (see rget)
 		- opcao para reescrever paginas html com links relativos
	Standards/perl:
		- packaging for distribution, include rfcs, etc?
		- include a default ini file in package
		- include web front-end in package?
		- installation hints, package version problems (abs_url)
		- more english writing
		- include all lwp-rget options, or ignore without exiting
 		- criar um objeto para as listas de links - escolher e especializar um existente.
 		- check: 19.4.5 HTTP Header Fields in Multipart Body-Parts
			Content-Encoding
			Persistent connections: Connection-header
			Accept: */*, *.*
 		- documentar melhor o uso de "\" em exclude e subst
 		- ler, enviar, configurar cookies
	Network/parallel support:		
		- timed downloads - start/stop hours
 		- gravar arquivo "to-do" durante o processamento, 
		para poder retomar em caso de interrupcao.
   		ex: a cada 10 minutos
 		- integrar com "k:\download"
		- receber / enviar comando restart / stop.
	Speed optimizations:
		- use an optional database connection
		- Persistent connections;
		- take a look at LWP::ParallelUserAgent
		- take a look at LWPng for simultaneous file transfers
		- take a look at LWP::Sitemapper
		- use eval around things do speed up program loading
 		- opcao: pilhas diferentes dependendo do tipo de arquivo ou site, para acelerar a procura
	Other:
 		- forms / PUT
 		- Renomear a extensao de acordo com o mime-type (ou copiar para o outro nome).
   		configuracao:	--on-redirect=rename 
                	  	--on-redirect=copy
				--on-mime=rename
				--on-mime=copy
 		- configurar tamanho maximo da URL
 		- configurar profundidade maxima de subdiretorios
 		- tamanho maximo do arquivo recebido
 		- disco cheio / alternate dir
 		- "--proxy=http:"1.1.1.1",ftp:"1.1.1.1"
  		  "--proxy="1.1.1.1"
  		    acessar proxy: $ua->proxy(...) Set/retrieve proxy URL for a scheme: 
  		    $ua->proxy(['http', 'ftp'], 'http://proxy.sn.no:8001/');
  		    $ua->proxy('gopher', 'http://proxy.sn.no:8001/');
		- enable "--no-[option]"
		- accept empty "--dump" or "--no-dump" / "--nodump"
 		--max-mb=100
 			limita o tamanho total do download
 		--auth=USER:PASS
 			nao e' realmente necessario, pode estar dentro da URL
			existe no lwp-rget
 		--nospace
 			permite links com espacos no nome (ver lwp-rget)
 		--relative-links
 			opcao para refazer os links para relativo
 		--include=".exe" --nofollow=".shtml" --follow=".htm"
 			opcoes de inclusao de arquivos (procurar links dentro)
 		--full ou --depth=full
 			opcao site inteiro
 		--chunk=128000
		--dump-all
			grava todos os links, incluindo os ja existentes e paginas processadas

Version history:

 1.022:
	- multiple --prefix and --exclude seems to be working
	- uses Accept:text/html to ask for an html listing of the directory when in "ftp" mode.
	- corrected errors creating directory and copying file on linux


 1.021:
	- uses URI::Heuristic on command-line URL
	- shows error response headers (if verbose)
	- look at the 3rd parameter on 206 (when available -- otherwise it gives 500),
			Content-Length: 637055		--> if "206" this is "chunk" size
			Content-Range: bytes 1449076-2086130/2086131 --> THIS is file size
	- prefix of: http://rd.yahoo.com/footer/?http://travel.yahoo.com/
  	  should be: http://rd.yahoo.com/footer/
	- included: "wav"
	- sleep had 1 extra second
	- sleep makes tests even when sleep==0


 1.020: oct-02-2000
	- optimization: accepts 200, when expecting 206
	- don't keep retrying when there is nothing to do
	- 404 Not Found error sometimes means "can't connect" - uses "--404-retry"
	- file read = binmode


 1.019: - restart if program was modified (-M $0)
	- include "mov"
	- stop, restart


 1.018: - better copy, rename and unlink
	- corrected binary dump when slave
	- comparacao de tamanho de arquivos corrigida
 	- span e' um comando de css, que funciona como "a" (a href == span href);
	  span class is not java


 1.017: - sleep prints dots if verbose.
	- daemon mode (--slave)
	- url and input file are optional


 1.016: sept-27-2000
	- new name "glynx.pl"
	- verbose/quiet
	- exponential timeout on retry
	- storage control is a bit more efficient
	- you can filter the processing of a dump file using prefix, exclude, subst
	- more things in english, lots of new "to-do"; "goals" section
	- rename config file to glynx.ini


 1.015: - first published version, under name "get.pl"
	- rotina unica de push/shift sem repeticao
	- traduzido parcialmente para ingles, revisao das mensagens


 1.014: - verifica inside antes de incluir o link
 	- corrige numeracao dos arquivos dump
 	- header "Location", "Content-Base"
	- revisado "Content-Location"


 1.013: - para otimizar: retirar repeticoes dentro da pagina
	- incluido "png"
	- cria/testa arquivo "not-found"
	- processa Content-Location - TESTAR - achar um site que use
	- incluido tipo "swf", "dcr" (shockwave) e "css" (style sheet)
 	- corrige http://host/../file gravado em ./host/../file => ./file
 	- retira caracteres estranhos vindos do javascript: ' ;
	- os retrys pendentes sao gravados somente no final.
	- (1) le opcoes, (2) le configuracao, (3) le opcoes de novo


 1.012: - segmenta o arquivo dump durante o processamento, permitindo iniciar o
	download em paralelo a partir de outro processo/computador antes que a tarefa esteja
	totalmente terminada
	- utiliza indice para gravar o dump; nao destroi a lista que esta na memoria.
	- salva a configuracao completa junto com o dump; 
	- salva/le get.ini


 1.011: corrige autenticacao (prefix)
	corrige dump
	le dump
	salva/le $OUT_DEPTH, depth (individual), prefix no arquivo dump


 1.010: resume
	se o site nao tem resume, tenta de novo e escolhe o melhor resultado (ideia do Silvio)


 1.009: 404 not found nao enviado para o dump
       processa arquivo se o tipo mime for text/html (nao funciona para o cache)
       muda o referer dos links dependendo da base da resposta (redirect)
       considera arquivos de tamanho zero como "nao no cache"
       gera nome _INDEX_.HTM quando o final da URL tem "/". 


 1.008: trabalha internamente com URL absolutas
       corrige vazamento quando out-nivel=0


 1.007: segmenta o arquivo dump 
       acelera a procura em @processed
       corrige o nome do diretorio no arquivo dump

Other problems - design decisions to make

- se usar '' no eval nao precisa de \\ ?
- paginas html redirecionadas devem receber um tag <BASE> no texto?
- montar links usando java ?
- a biblioteca perl faz sozinha Redirection 3xx ?
- usar File::Path para criar diretorios ?
- applets sempre tem .class no fim?
- file names excessivamente longos - o que fazer?
- usar: $ua->max_size([$bytes]) - nao funciona com callback
- mudar o filename se a base da resposta e diferente?
- criar arquivo PART com tamanho zero quando da erro 408 - timeout
- como e' o formato dump do go!zilla?

COPYRIGHT

Copyright (c) 2000 Flavio Glock <fglock@pucrs.br>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program was based on examples in the Perl distribution.

If you use it/like it, send a postcard to the author.

To install Glynx, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Glynx

CPAN shell

perl -MCPAN -e shell
install Glynx

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)