The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

bin/freq.pl - term frequency counter for IAFA style templates

SYNOPSIS

  freq.pl [-ad] [-f maxhits] [-m min-count] [-s sourcedir]
    [-t tmpdir] [-A attrib1|attrib2|...|attribN]

DESCRIPTION

This Perl program will look at all the IAFA style templates in a given directory, and count the number of times each term found in the templates occurs. This has a number of uses - notably in determining an appropriate stop-list of words which should not be indexed, and in helping the user to devise an effective query.

Frequently appearing terms such as a, and the will likely cause large numbers of spurious hits when people search your database. To reduce the likelihood of this, we have added a ``stoplist'' feature to the ROADS search back end - this lets you arrange for certain search terms to be automatically removed, and we ship a sample stop list with the ROADS distribution.

The default behaviour is to sort the frequency count into order, and return the top fifty terms. This can be overridden by a set of command-line options.

OPTIONS

-a

send back a complete frequency count, rather than just the most frequently used terms

-d

produce verbose debugging output

-f maxhits

send back at most the top maxhits most frequently used terms, e.g. to see the top 100 with debugging info

  freq.pl -df 100 
-m min-count

stop once the frequency count falls below min-count, e.g. to get a list of all the terms which occur more than 999 times

  freq.pl -m 999 | cut -f2 -d' '
-s sourcedir

look for the templates in the directory sourcedir, e.g. to use the templates in the directory /work2/WWW/roads and return a complete frequency breakdown

  freq.pl -as /work2/WWW/roads
-t tmpdir

use tmpdir as temporary directory. This defaults to /tmp, but you may need to change the default if your machine does not have enough room in /tmp for any temporary files generated by freq.pl, e.g.

  freq.pl -t /var/tmp
-A attribute-list

only produce frequency list for the attributes listed in attribute-list. attribute-list is a '|' (pipe) separated list of attribute names, e.g.

  freq.pl -A 'description|keywords'

OUTPUT FORMAT

The output of freq.pl consists of the frequency count for a term, followed by a single space character, followed by the term itself, e.g.

  310 research
  283 mailing
  270 available
  268 University

DEPENDENCIES

An external program called "sort" is used to sort the frequency count into descending order. This is a standard feature of most (all?) implementations of Unix, but the command line options it takes may differ from version to version. Let us know if you find a version which does not understand -r, -n or -T!

TODO

Nothing ? :-)

SEE ALSO

"freq.pl" in admin-cgi

COPYRIGHT

Copyright (c) 1988, Martin Hamilton <martinh@gnu.org> and Jon Knight <jon@net.lut.ac.uk>. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

It was developed by the Department of Computer Studies at Loughborough University of Technology, as part of the ROADS project. ROADS is funded under the UK Electronic Libraries Programme (eLib), the European Commission Telematics for Research Programme, and the TERENA development programme.

AUTHOR

Martin Hamilton <martinh@gnu.org>