LWPng
This note describe the redesign of the LWP perl modules in order to add full support for the HTTP/1.1 protocol. The main change is the adoption of an event driven framework. This allows us to support multiple connections within a single client program. It was also a prerequisite for supporting HTTP/1.1 features like persistent connections and pipelining.
HTTP/1.1
RFC 2068 is the proposed standard for the Hypertext Transfer Protocol version 1.1, usually denoted HTTP/1.1. The document is currently revised by the IETF and a draft standard document is expected soon?? The latest draft is currently <draft-ietf-http-v11-spec-rev-03.txt>
The HTTP/1.1 protocol use the same basic message format as earlier versions of the protocol and HTTP/1.1 clients/servers can easily adopt to peers which only know about the older versions of the protocol. HTTP/1.1 adds some new methods, some new status codes, and some new headers. One important change is that the Host header is now mandatory. Other changes is support for partial content and that the support for caching and proxies has been much improved on. There is also a standard mechanism of switching from HTTP/1.1 to some other (more suitable) protocol on the wire.
The most important change with HTTP/1.1 is the introduction of persistent connections. This means that more than one request/response exchange takes place on a single TCP connection between a client and a server. This improves performance and generally interacts much better with how TCP works underneath. This also means that the peers must be able to tell the extent of the messages on the wire. In HTTP/1.0 the only way to do this was by using the Content-Length header and by closing the connection (which was only an option for the server). Use of the Content-Length header is not appropriate when the length of the message can not be determined in advance. HTTP/1.1 introduce two new ways to delimit messages; the chunked transfer encoding and self delimiting multipart content types. The chunked transfer encoding means that the message is broken into chunks of arbitrary sizes and that each chunk is preceded by a line specifying the number of bytes in the chunk. The multipart types use a special boundary bytepattern as a delimiter for the messages.
With persistent connections one can improve performance even more by the use of a technique called "pipelining". This means that the client sends multiple requests down the connections without waiting for the response of the first request before sending the second. This can have a dramatic effect on the thoughput for high latency links. [NOTE-pipelining-970624]
Event driven programming model
Let's investigate what impact the event driven framework has on the programming model. The basic model for sending requests and receving respones used to be:
$res = $ua->request($req); # return when response is available
if ($res->is_success) {
#...
}
With the new event driven framework it becomes:
$ua->spool($req1); # returns immediately
$ua->spool($req2); # can send multiple request in parallel
#...
mainloop->run; # return when all connections are gone
Request objects are created and then handed off to the $ua which will queue them up for processing. As you can see, there is no longer any natural place to test the outcome of the requests. What happen is that the requests live their own lives and they will be notified (though a method call) when the corresponding responses are available. You, the application programmer, will have to set up event handlers (in the requests) that react to these events.
Luckily, this does not mean that all old programs must be rewritten. The following show one way to emulate something very close to the old behaviour:
my $res;
my $req = LWP::Request->new(GET => $url);
$req->{'done_cb'} = sub { $res = shift; }
$ua->spool($req);
mainloop->one_event until $res;
if ($res->is_success) {
#...
}
and this will in fact be used to emulate the old $ua->request() and $ua->simple_request() interfaces. The goal is to be able to completely backwards compatible with the current LWP modules.
LWP::Request
As you can see from the examples above we use the class name LWP::Request (as opposed to HTTP::Request) for the requests created. LWP::Request is a subclass of HTTP::Request, thus it have all the same methods and attributes as HTTP::Request and then some more. The most important of these additions are two callback methods that will be invoked as the response is received:
$req->response_data($data, $res);
$req->response_done($res);
The response_data() callback method is invoked repeatedly as parts of the content of the response becomes available. The first time it is invoked, then $res will be a reference to a HTTP::Response object with response code and headers initialized, and empty content. The default implementation of response_data just appends the data passed to the content of the $res object. It also supports a registered callback function ('data_cb') that will be invoked if defined.
The response_done() callback method is invoked when the whole response has been received. It is guaranteed that it will be invoked exactly once for each request spooled (even if it fails.) The default implementation will set up the $res->request and $res->previous links and will automatically handle redirects and unauthorized responses by respooling a slightly modified copy of the original requests. It also supports a registered callback function ('done_cb') that will invoked if defined, but only for the last response in case of redirect chains.
As an application programmer you can either subclass LWP::Request, to provide your own versions of response_data() and response_done(), or you can just register callback functions.
The LWP::Request object also provide a few more attributes that might be of interest. The $req->priority is a number that can be used to select which request goes first when multiple are spooled at the same time. Requests with the smallest numbers go first. The default priority happens to be 99.
The $req->proxy attribute tells us if we are going to pass the request to an proxy server instead of the server implied by the URL. If $req->proxy is TRUE, then it should be the URL of the proxy.
LWP::MainLoop
The event oriented framework is based on a single common object provided by the LWP::MainLoop module that will watch external IO descriptors (sockets) and timers. When events occur, then registered functions are called and these will call other event handling functions and so on.
In order for this to work, the mainloop object needs to be in control when nothing else happens and you especially when you expect protocol handling to take place. This is achieved by repeatedly calling the mainloop->one_event method until we are satisfied. Each call will wait until the next event is available, then invoke the corresponding callback function and then return. The one_event() interface is handy because it can be applied recursively and you can set up inner event loops in event handlers invoked by some outer event loop.
The call mainloop->run is a shorthand for a common form of this loop. It will call mainloop->one_event until there is no registered IO handles (sockets) and no timers left.
The following program shows how you can register your own callbacks. For instance as here the application might want to be able to read commands from the terminal.
use LWP::MainLoop qw(mainloop);
mainloop->readable(\*STDIN, \&read_and_do_cmd);
mainloop->run;
sub read_and_do_cmd
{
my $cmd;
my $n = sysread(STDIN, $cmd, 512);
chomp($cmd);
if ($cmd eq "q") {
exit;
} elsif ($cmd =~ /^(get|head|trace)\s+(\S+)/i) {
$ua->spool(LWP::Request->new(uc($1) => $2));
} ...
}
Currently LWPng use its own private event loop implementation. The plan is to adopt the event loop implementation used by the Tk extention. This should allow happies mixing of Tk and LWPng.
LWP::UA
$ua->spool
$ua->reschedule
$ua->stop
The following 'connection parameters' can be adjusted. You can set them both at the global level and for each individual server.
- ReqLimit
-
This is a number indicating how many requests a single connection will be used for. When the limit has reached, then the connection will close itself. You can get non-persistent connections by specifying this value to be 1.
- ReqPending
-
How many request can be send on a single connections before we have to wait for response from the server. This controls the degree of pipelining.
- Timeout
-
For how long will we wait with no activity on the line, before signaling an error (and closing the connection).
- IdleTimeout
-
When the request queue is empty, connections go to the idle state. This specify how long before the connection is closed if no new work arrives.
Internals overview
For each server that the $ua is going to talk to it maintain a LWP::Server object. This object holds a queue of requests not yet processes. The $ua->spool() method mainly move the request to the correct queue.
A LWP::Server can also create one or more LWP::Conn::HTTP objects that each represent a network connection to the server. The connection objects are were all the action takes place. They will fetch work (request) from the server queue, talk the network protocol and create response objects.
LWP::UA
LWP::Server
LWP::Conn:XXX
URI::Attr
LWP::StdSched
LWP::Conn interface
The LWP::Conn objects represent a connection to some server where one or more request/response exchanges can take place. There are different subclasses for various types of the underlying (network) protocols. (Talking about 'subclasses' is kind of a lie, since the base-class does not really manifest itself as any real code.)
LWP::Conn objects conform to the following interfaces when interacting with their manager object (passed in as parameter during creation). For the normal setup, then manager will be a LWP::Server object.
A LWP::Conn object is contructed with the new() method. It takes hash-style arguments and the 'ManagedBy' parameter is really the only mandatory one. It should be an reference to the manager object that will get method callbacks when various events happen.
$conn = LWP::Conn::XXX->new(MangagedBy => $mgr,
Host => $host,
Port => $port,
...);
The constructor will return a reference to the LWP::Conn object or undef
. If a connection object is returned, then the manager should wait for callbacks methods to be invoked on itself. A return of undef
will either indicate than we can't connect to the specified server or that all requests has already been processed. A manager can know the difference based on whether get_request() has been invoked on it or not.
The following methods are invoked by the created LWP::Conn object on their manager. The first two manage the request queue. The last three let the manager be made aware of the state of the connection.
$mgr->get_request($conn);
$mgr->pushback_request($conn, @requests);
$mgr->connection_active($conn);
$mgr->connection_idle($conn);
$mgr->connection_closed($conn);
The get_request() method should return a single LWP::Request
object or undef if there are no more requests to process. It is passed a reference to the connection object as argument. If the connection objects discover that it has been too greedy (calling get_request() too much), then it might want to return unprocessed request back to the mangager. It does so by calling the pushback_request() method with a reference to itself and one or more request objects as arguments. The first request obtained by get_request() should never be pushed back.
The following two methods can be invoked (usually by the manager) on a living $conn object. The activate() method can be invoked on a 'idle' connection to make it start calling get_request() again. The stop() kills the connection (whatever state it is in).
$conn->activate;
$conn->stop;
When a connection has received a response, then it will invoke the following two methods on the request object (obtained using get_request()).
$req->response_data($data, $res);
$req->response_done($res);
The response_data() method is invoked repeatedly as the body content of the response is received from the network. Invocation of this method is optional and depends on the kind of connection object this is. The response_done() method is always invoked once for each request obtained. It is called when the complete response has been received.
LWP::Conn::HTTP
Well, this is a very special module. You don't usually have designs where the objects change their class all the time.
You should just know about 3 basic states that a connection object can be in and then think of writable(), readable() and inactive() as the three kind of events that we should be prepared to handle at any time.
- 1) Connecting (a non-blocking connect() has been called)
-
We are waiting for the socket to become writable (which means that the connect was successful.) readable cant happen. inactive() means failure to connect.
- 2) Idle (No work to do)
-
Either work arrive from the application or the socket will become readable. When we read we would expect to get 0 bytes as a signal that the server has closed the connection. We don't ask for the writable event in this state, because we have nothing to write.
- 3) Active (sending request(s), receiving header of first request)
-
If the socket becomes writable we send more request data until we are done. If the socket becomes readable we read data until we have seen a whole HTTP header and then swith to a (sub-state) depending on the kind of response we are reading. When a response has been completely received we go back to idle if there is not more requests to send.
The following is an attempt on a picture of the state transitions going on.
START ------> Connecting
| |
| |
V |
V
Idle <-------------+---------<-----\
/\(work?) \
/ \ \
/ | \
/ V \
/ \
/ Active (sending request) <-----------+ (more work?)
| | \ \ ----- |
| | \ \ \ |
| | \ \ \ |
| | | | | |
| V V V V |
| |
| ConnLen Chunked Multipart UntilEOF | (reading response)
| | | | | |
| | | | | |
| +-------->+------->+--------- | ------>+
V |
|
Closed <------------------------------+
(THE END)
This design was inspired by the "State" pattern described in "Design Patterns: Elements of Reusable Object-Oriented Software" (Gamma et.al). The description of the "State" pattern in this book says:
- Intent
-
Allow an object to alter its behaviour when its internal state changes. The object will appear to change its class.
- Motivation
-
Consider a class TCPConnection that represents a network connection. A TCPConnection object can be in one of several different states: Established, Listening, Closed. When a TCPConnection object receives request from other objects, it responds differently depending on its current state. For example, the effect of an Open request depends on whether the connection is in its Colsed state or its Establised state. The state pattern describes how TCPConnection can exhibit different behaviour in each state.
Lucky for me Perl is a language that allows me to change the class of a living object. That became handy in this situation.
Using classes to describe states also allows a natural description (and implementation) of substates that behaves like it's base-state for some events but modify the behaviour for others.