NAME

ApacheLog::Compressor - convert Apache / CLF log files into a binary format for transfer

VERSION

version 0.005

SYNOPSIS

 use ApacheLog::Compressor;
 use Sys::Hostname qw(hostname);

 # Write all data to bzip2-compressed output file
 open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!";
 binmode $out_fh;
 my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9);

 # Provide a callback to send data through to the file
 my $alc = ApacheLog::Compressor->new(
	on_write	=> sub {
		my ($self, $pkt) = @_;
		$zip->write($pkt);
	}
 );

 # Input file - normally use whichever one's just been closed + rotated
 open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!";

 # Initial packet to identify which server this came from
 $alc->send_packet('server',
 	hostname	=> hostname(),
 );

 # Read and compress all the lines in the files
 while(my $line = <$fh>) {
	 $alc->compress($line);
 }
 close $fh or die $!;
 $zip->close;

 # Dump the stats in case anyone finds them useful
 $alc->stats;

DESCRIPTION

Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file. Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime via tail -f).

The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the first instance of each variable-width data item.

Example:

api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"

The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.

This would be converted to:

  • vhost packet - api.example.com assigned index 0

  • user packet - apiuser@example.com assigned index 0

  • url packet - /api/status.json assigned index 0

  • timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.

  • log packet - actual data, binary encoded.

The following packet types are available:

  • 00 - Log entry

  • 01 - Change server

  • 02 - timestamp

  • 03 - vhost

  • 04 - user

  • 05 - useragent

  • 06 - referer

  • 07 - url

  • 80 - reset

The log entry itself normally consists of the following fields:

N vhost
N time
N IP
N user
N useragent
N timestamp
C method
C version
n response
N bytes
N url

The format of the log file can be customised, see the next section for details.

FORMAT SPECIFICATION

A custom format can be provided as the format parameter when instantiating a new ApacheLog::Compressor object via ->"new". This format consists of an arrayref of key/value pairs, each value holding the following information:

  • id - the ID to use when sending packets

  • type - pack format specifier used when storing and retrieving the data, such as N1 or n1. Without this there will be no entry for the item in the compressed log stream

  • regex - the regular expression used for matching this part of the log file. The final regex will be the concatenation of all regex entries for the format, joined using \s+ as the delimiter.

  • process_in - coderef for converting incoming values from a plain text log source into compressed values, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).

  • process_out - coderef for converting values from a compressed source back to plain text, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).

METHODS

new

Instantiate the class.

Takes the following named parameters:

  • on_write - coderef to call with packet data for each outgoing packet

default_format

Returns the default format used for parsing log lines.

This is an arrayref containing key => value pairs, see "FORMAT SPECIFICATION" for more details.

update_mapping

Refresh the mapping from format keys and internal definitions.

cached

Returns the index for the given type and value, generating a packet if no previous value was found.

from_cache

Read a value from the cache, for expanding compressed log format entries.

set_key

Set a cache index key to a value when expanding a packet stream.

compress

General compression function. Given a line of data, sends packets as required to transmit that information.

send_packet

Generate and send a packet for the given type.

packet_reset

Generate a reset packet and clear internal caches in the process.

packet_server

Generate a server packet.

packet_timestamp

Generate the timestamp packet.

write_packet

Write a packet to the output handler.

expand

Expand incoming data.

handle_reset

Handle an incoming reset packet.

handle_log

Handle an incoming log packet.

data_hashref

Convert logline data to a hashref.

data_to_text

Internal method for converting the current log entry to a text string in something approaching the 'standard' Apache log format (almost, but not quite, CLF).

handle_server

Internal method for processing a server record (used to indicate the server name subsequent records apply to).

handle_timestamp

Internal method for processing a timestamp entry.

invoke_event

Internal method for invoking an event.

stats

Print current stats - not all that useful since we clear cached values regularly.

AUTHOR

Tom Molesworth <cpan@entitymodel.com>

LICENSE

Copyright Tom Molesworth 2009-2011. Licensed under the same terms as Perl itself.