NAME
ApacheLog::Compressor - convert Apache / CLF log files into a binary format for transfer
VERSION
version 0.005
SYNOPSIS
use ApacheLog::Compressor;
use Sys::Hostname qw(hostname);
# Write all data to bzip2-compressed output file
open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!";
binmode $out_fh;
my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9);
# Provide a callback to send data through to the file
my $alc = ApacheLog::Compressor->new(
on_write => sub {
my ($self, $pkt) = @_;
$zip->write($pkt);
}
);
# Input file - normally use whichever one's just been closed + rotated
open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!";
# Initial packet to identify which server this came from
$alc->send_packet('server',
hostname => hostname(),
);
# Read and compress all the lines in the files
while(my $line = <$fh>) {
$alc->compress($line);
}
close $fh or die $!;
$zip->close;
# Dump the stats in case anyone finds them useful
$alc->stats;
DESCRIPTION
Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file. Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime via tail -f).
The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the first instance of each variable-width data item.
Example:
api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"
The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.
This would be converted to:
vhost packet - api.example.com assigned index 0
user packet - apiuser@example.com assigned index 0
url packet - /api/status.json assigned index 0
timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.
log packet - actual data, binary encoded.
The following packet types are available:
00 - Log entry
01 - Change server
02 - timestamp
03 - vhost
04 - user
05 - useragent
06 - referer
07 - url
80 - reset
The log entry itself normally consists of the following fields:
N vhost
N time
N IP
N user
N useragent
N timestamp
C method
C version
n response
N bytes
N url
The format of the log file can be customised, see the next section for details.
FORMAT SPECIFICATION
A custom format can be provided as the format
parameter when instantiating a new ApacheLog::Compressor object via ->"new". This format consists of an arrayref of key/value pairs, each value holding the following information:
id - the ID to use when sending packets
type - pack format specifier used when storing and retrieving the data, such as N1 or n1. Without this there will be no entry for the item in the compressed log stream
regex - the regular expression used for matching this part of the log file. The final regex will be the concatenation of all regex entries for the format, joined using \s+ as the delimiter.
process_in - coderef for converting incoming values from a plain text log source into compressed values, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).
process_out - coderef for converting values from a compressed source back to plain text, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).
METHODS
new
Instantiate the class.
Takes the following named parameters:
on_write - coderef to call with packet data for each outgoing packet
default_format
Returns the default format used for parsing log lines.
This is an arrayref containing key => value pairs, see "FORMAT SPECIFICATION" for more details.
update_mapping
Refresh the mapping from format keys and internal definitions.
cached
Returns the index for the given type and value, generating a packet if no previous value was found.
from_cache
Read a value from the cache, for expanding compressed log format entries.
set_key
Set a cache index key to a value when expanding a packet stream.
compress
General compression function. Given a line of data, sends packets as required to transmit that information.
send_packet
Generate and send a packet for the given type.
packet_reset
Generate a reset packet and clear internal caches in the process.
packet_server
Generate a server packet.
packet_timestamp
Generate the timestamp packet.
write_packet
Write a packet to the output handler.
expand
Expand incoming data.
handle_reset
Handle an incoming reset packet.
handle_log
Handle an incoming log packet.
data_hashref
Convert logline data to a hashref.
data_to_text
Internal method for converting the current log entry to a text string in something approaching the 'standard' Apache log format (almost, but not quite, CLF).
handle_server
Internal method for processing a server record (used to indicate the server name subsequent records apply to).
handle_timestamp
Internal method for processing a timestamp entry.
invoke_event
Internal method for invoking an event.
stats
Print current stats - not all that useful since we clear cached values regularly.
AUTHOR
Tom Molesworth <cpan@entitymodel.com>
LICENSE
Copyright Tom Molesworth 2009-2011. Licensed under the same terms as Perl itself.