NAME

HTTP::Proxy - A pure Perl HTTP proxy

SYNOPSIS

use HTTP::Proxy;

# initialisation
my $proxy = HTTP::Proxy->new( port => 3128 );

# alternate initialisation
my $proxy = HTTP::Proxy->new;
$proxy->port( 3128 ); # the classical accessors are here!

# you can also use your own UserAgent
my $agent = LWP::RobotUA->new;
$proxy->agent( $agent );

# this is a MainLoop-like method
$proxy->start;

DESCRIPTION

This module implements a HTTP proxy, using a HTTP::Daemon to accept client connections, and a LWP::UserAgent to ask for the requested pages.

The most interesting feature of this proxy object is its hability to filter the HTTP requests and responses through user-defined filters.

METHODS

Constructor

The new() method creates a HTTP::Proxy object. All attributes can be passed as a parameter to replace the default.

Accessors and mutators

The HTTP::Proxy has several accessors and mutators.

Called with arguments, the accessor returns the current value. Called with a single argument, it sets the current value and returns the previous one, in case you want to keep it.

If you call a read-only accessor with a parameter, this parameter will be ignored.

The defined accessors are (in alphabetical order):

agent

The LWP::UserAgent object used internally to connect to remote sites.

conn (read-only)

The number of connections processed by this HTTP::Proxy instance.

control

The default hostname for controlling the proxy (see CONTROL). The default is "proxy", which corresponds to the URL http://proxy/, where port is the listening port of the proxy).

daemon

The HTTP::Daemon object used to accept incoming connections. (You usually never need this.)

hop_headers

This attribute holds a reference to the hop-by-hop headers (Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization, TE, Trailers, Transfer-Encoding, Upgrade).

They are removed by the filter HTTP::Proxy::HeaderFilter::standard from the request and response objects received by the proxy.

If a filter (such as a proxy authorisation filter) need to access them, it must do it though this accessor.

host

The proxy HTTP::Daemon host (default: 'localhost').

This means that by default, the proxy answers only to clients on the local machine. You can pass a specific interface address or ""/undef for any interface.

This default prevents your proxy to be used as an anonymous proxy by script kiddies.

logfh

A filehandle to a logfile (default: *STDERR).

logmask( [$mask] )

Be verbose in the logs (default: NONE).

Here are the various elements that can be added to the mask: NONE - Log only errors STATUS - Requested URL, reponse status and total number of connections processed PROCESS - Subprocesses information (fork, wait, etc.) HEADERS - Full request and response headers are sent along FILTER - Filter information ALL - Log all of the above

If you only want status and process information, you can use:

$proxy->logmask( STATUS | PROCESS );

Note that all the logging constants are not exported by default, but by the :log tag. They can also be exported one by one.

maxchild

The maximum number of child process the HTTP::Proxy object will spawn to handle client requests (default: 16).

If set to 0, the proxy will not fork at all. This can be helpful for debugging purpose.

maxconn

The maximum number of TCP connections the proxy will accept before returning from start(). 0 (the default) means never stop accepting connections.

maxserve

The maximum number of requests the proxy will serve in a single connection. (same as MaxRequestsPerChild in Apache)

port

The proxy HTTP::Daemon port (default: 8080).

request

The request originaly received by the proxy from the user-agent, which will be modified by the request filters.

response

The response received from the origin server by the proxy. It is normally undef until the proxy actually receives the beginning of a response from the origin server.

If one of the request filters sets this attribute, it "short-circuits" the request/response scheme, and the proxy will return this response (which is NOT filtered through the response filter stacks) instead of the expected origin server response. This is useful for caching (though Squid does it much better) and proxy authentication, for example.

timeout

The timeout used by the internal LWP::UserAgent (default: 60).

url (read-only)

The url where the proxy can be reached.

via ($hostname (HTTP::Proxy/$VERSION))

The content of the Via: header. Setting it to an empty string will prevent its addition.

The start() method

This method works like Tk's MainLoop: you hand over control to the HTTP::Proxy object you created and configured.

If maxconn is not zero, start() will return after accepting at most that many connections. It will return the total number of connexions.

FILTERS

You can alter the way the default HTTP::Proxy works by pluging callbacks at different stages of the request/response handling.

When a request is received by the HTTP::Proxy object, it is filtered through a standard filter that transform this request accordingly to RFC 2616 (by adding the Via: header, and a few other transformations).

The response is also filtered in the same manner. There is a total of four filter chains: request-headers, request-body, reponse-headers and response-body.

You can add your own filters to the default ones with the push_filter() method. The method push a filter on the appropriate filter stack.

$proxy->push_filter( response => $filter );

The headers/body category is determined by the type of the filter. There are two base classes for filters, which are HTTP::Proxy::HeaderFilter and HTTP::Proxy::BodyFilter (the names are self-explanatory). See the documentation of those two classes to find out how to write your own header or body filters.

The named parameter is used to determine the request/response part.

It is possible to push the same filter on the request and response stacks, as in the following example:

$proxy->push_filter( request => $filter, response => $filter );

If several filters match the message, they will be applied in the order they were pushed on their filter stack.

Named parameters can be used to create the match routine. They are:

mime   - the MIME type (for a response-body filter)
method - the request method
scheme - the URI scheme         
host   - the URI authority (host:port)
path   - the URI path
query  - the URI query string

The filters are applied only when all the the parameters match the request or the response. All these named parameters have default values, which are:

mime   => 'text/*'
method => 'GET, POST, HEAD'
scheme => 'http'
host   => ''
path   => ''
query  => ''

The mime parameter is a glob-like string, with a required / character and a * as a joker. Thus, */* matches all responses, and "" those with no Content-Type: header. To match any reponse (with or without a Content-Type: header), use undef.

The mime parameter is only meaningful with the response-body filter stack. It is ignored if passed to any other filter stack.

The method and scheme parameters are strings consisting of comma-separated values. The host and path parameters are regular expressions.

A match routine is compiled by the proxy and used to check if a particular request or response must be filtered through a particular filter.

It is also possible to push several filters on the same stack with the same match subroutine:

# convert italics to bold
$proxy->push_filter(
    mime     => 'text/html',
    response => HTTP::Proxy::BodyFilter::tags->new(),
    response =>
    HTTP::Proxy::BodyFilter::simple->new( sub { s!(</?)i>!$1b>!ig } )
);

For more details regarding the creation of new filters, check the HTTP::Proxy::HeaderFilter and HTTP::Proxy::BodyFilter documentation.

Here's an example of subclassing a base filter class:

# fixes a common typo ;-)
# but chances are that this will modify a correct URL
{
    package FilterPerl;
    use base qw( HTTP::Proxy::BodyFilter );

    sub filter {
        my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
        $$dataref =~ s/PERL/Perl/g;
    }
}
$proxy->push_filter( response => FilterPerl->new() );

Other examples can be found in the documentation for HTTP::Proxy::HeaderFilter, HTTP::Proxy::BodyFilter, HTTP::Proxy::HeaderFilter::simple, HTTP::Proxy::BodyFilter::simple.

# a simple anonymiser
# see eg/anonymiser.pl for the complete code
$proxy->push_filter(
    mime    => undef,
    request => HTTP::Proxy::HeaderFilter::simple->new(
        sub { $_[0]->remove_header(qw( User-Agent From Referer Cookie )) },
    ),
    response => HTTP::Proxy::HeaderFilter::simple->new(
        sub { $_[0]->remove_header(qw( Set-Cookie )); },
    )
);

IMPORTANT: If you use your own LWP::UserAgent, you must install it before your calls to push_filter(), otherwise the match method will make wrong assumptions about the schemes your agent supports.

log( $level, $prefix, $message )

Adds $message at the end of logfh, if $level matches logmask. The log() method also prints a timestamp.

The output looks like:

[Thu Dec  5 12:30:12 2002] $prefix $message

If $message is a multiline string, several log lines will be output, each starting with $prefix.

EXPORTED SYMBOLS

No symbols are exported by default. The :log tag exports all the logging constants.

BUGS

This module does not work under Windows, but I can't see why, and do not have a development platform under that system. Patches and explanations very welcome.

David Fishburn says:

    This did not work for me under WinXP - ActiveState Perl 5.6, but it DOES work on WinXP ActiveState Perl 5.8.

I guess it is because fork() is not well supported. You can try to use the following workaround to prevent forking:

$proxy->maxchild(0);

SEE ALSO

Proxy::BodyFilter, Proxy::HeaderFilter, the examples in eg/.

AUTHOR

Philippe "BooK" Bruhat, <book@cpan.org>.

The module has its own web page at http://http-proxy.mongueurs.net/ complete with older versions and repository snapshot.

There are also two mailing-lists: http-proxy@mongueurs.net for general discussion about HTTP::Proxy and http-proxy-cvs@mongueurs.net for CVS commits.

THANKS

Many people helped me during the development of this module, either on mailing-lists, irc or over a beer in a pub...

So, in no particular order, thanks to the libwww-perl team for such a terrific suite of modules, Michael Schwern (tips for testing while forking), the Paris.pm folks (forking processes, chunked encoding) and my growing user base... ;-)

COPYRIGHT

This module is free software; you can redistribute it or modify it under the same terms as Perl itself.