The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

bin/lc.pl - Perl based HTML/IAFA link checker

SYNOPSIS

  bin/lc.pl [-acdilPsvux] [-b base_url] [-g guts_dir]
    [-p proxyurl] [-r seconds] [-t templatedir]
    [-w when_changed] [file1 file2 ... fileN]

DESCRIPTION

This program will take a set of URLs on their own, in a set of IAFA templates, or in HTML documents and attempt to check their accessibility. It can be passed a list of file names to examine on the command line or via standard input, e.g.

  find . -print | lc.pl -i

or

  lc.pl -v *.html > logfile

Normal behaviour is to ignore directories, files whose names begin with a dot ".", and files which do not appear to contain HTML - based on their suffix. This last restriction can be removed with a command line option which tells the program to assume the files are all IAFA templates.

Currently the only URL schemes which can be checked with lc.pl are "http:", "gopher:", "ftp:" and "wais:". A future version may try to check other URL schemes.

lc.pl will not follow links in HTML documents recursively!

PROXIES AND CACHING

It is recommended that a World-Wide Web cache server be used as a go-between in the link checking process. This can be enabled via environmental variables, e.g. in the style of csh and tcsh:

  setenv http_proxy "http://wwwcache.lut.ac.uk:3128/"
  setenv gopher_proxy "http://wwwcache.lut.ac.uk:3128/"
  setenv ftp_proxy "http://wwwcache.lut.ac.uk:3128/"
  setenv wais_proxy "http://wwwcache.lut.ac.uk:8001/"
  setenv no_proxy "lut.ac.uk"

Or in the sh/bash/ksh/zsh style:

  http_proxy="http://wwwcache.lut.ac.uk:3128/"
  gopher_proxy="http://wwwcache.lut.ac.uk:3128/"
  ftp_proxy="http://wwwcache.lut.ac.uk:3128/"
  wais_proxy="http://wwwcache.lut.ac.uk:8001/"
  no_proxy="lut.ac.uk"
  export http_proxy gopher_proxy ftp_proxy wais_proxy no_proxy

The -p and -P options may also be used to affect proxying and hence caching behaviour. Note that if you use -p to specify a single proxy server for all your requests, this must be capable of handling any "wais:" URLs that may be passed to it. You can run lc.pl with the -l option to check for these in advance of doing the actual link check.

In addition to cache support via the proxy HTTP mechanism - URLs which have already been visited during an link checking session will not be requested again in the same session, and the HTTP "HEAD" method is used whenever an "http" URL is requested. The time to sleep between requests is configurable, defaulting to two seconds.

OPTIONS

-a

check all IAFA templates. Uses ROADS default template directory, or another directory specified with the -t option. Implies -i.

-b baseurl

specifies a base URL which will be used to make any relative links absolute, e.g.

  -b http://www.roads.lut.ac.uk/
-c

check HTTP URLs which appear to run a script, i.e. contain the strings "/htbin/", "/cgi-bin/", or "?". Normally these will not be checked

-d

generate debugging info

-g

'guts' directory, used to hold DBM databases of Last-Modified times and Content-Length information on a per URL basis.

-i

specify source is IAFA templates, default is HTML

-l

don't actually check, just dump out URLs. This can be useful in finding out which URLs are cited, which documents make the citations, and so on

-p proxyurl

proxy all requests through the URL which follows, e.g.

  -p http://wwwcache.lut.ac.uk:3128/
-P

don't import any proxy settings from the environment

-r seconds

rest time between URL lookups (default is 2 seconds). This feature is turned off is you enabled the -l option, since there is not going to be any networking going on

-s

strict checking mode, default is not to follow links which look as though they might be to large objects, e.g. MPEG movies. Strict mode causes all links to be checked

-t templatedir

look in this directory for IAFA templates when -a option is enabled

-u

list unchecked URLs to stderr, e.g.

  lc.pl -u *.html >successlog 2>failslog
-v

list OK URLs as well as stale URLs

-w when_changed

list only URLs which have changed in the last N days

-x

the input is a series of URLs, rather than IAFA or HTML files, e.g.

  lc.pl -x < my_list_of_urls

OUTPUT FORMAT

The basic format for lc.pl output is

  <HTTP response code> <name of file containing URL> <URL>

e.g.

  404 SOSIG347 http://www.iss.u-tokyo.ac.jp/center/SSJ.html

Libwww-perl automatically translates the result codes of requests in protocols other than HTTP into their HTTP equivalents. If you use the -v option to get the results of successful requests too, the successful requests will be stamped with a 200 repsonse code, e.g.

  200 SOSIG345 http://www.ssd.gu.se/enghome.html

The output generated by the -u and -l options takes the form

  <name of file containing URL> <URL>

e.g.

  SOSIG345 http://www.ssd.gu.se/enghome.html

DEPENDENCIES

The libwww-perl package is used to parse HTML documents, and to check the links themselves. At the time of writing, libwww-perl version 5 and Perl version 5.003 or above are recommended

TODO

Add support for other protocol schemes ? "finger:" should be easily done via proxy HTTP, but the cache servers don't speak this protocol scheme yet (and neither do many WWW authors?) "mailto:" and "mailserver:" could be done up to a point with code which checked for valid domain names, MX records and so on. An SMTP session to the remote server would be do-able, but then we wouldn't be able to take advantage of the current caching infrastructure... "telnet:" is another case in point. We could check the machine had a working DNS entry, and perhaps try to ping it, or even connect to the listed port. How far to take this is a matter for debate!

SEE ALSO

"lc.pl" in admin-cgi, "report.pl" in bin, "report.pl" in admin-cgi

COPYRIGHT

Copyright (c) 1988, Martin Hamilton <martinh@gnu.org> and Jon Knight <jon@net.lut.ac.uk>. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

It was developed by the Department of Computer Studies at Loughborough University of Technology, as part of the ROADS project. ROADS is funded under the UK Electronic Libraries Programme (eLib), the European Commission Telematics for Research Programme, and the TERENA development programme.

AUTHOR

Martin Hamilton <martinh@gnu.org>