NAME
WWW::phpBB - phpBB forum scraper
SYNOPSIS
use WWW::phpBB;
# scrape as guest
my $phpbb = WWW::phpBB->new(
base_url => 'http://localhost/~stefan/forum1',
db_host => 'localhost',
db_user => 'stefan',
db_passwd => 'somepass',
db_database => 'stefan',
db_prefix => 'phpbb2_',
);
$phpbb->empty_tables();
$phpbb->get_users();
$phpbb->scrape_forum_common();
# scrape a german forum, loging in just to get the memberlist
my $phpbb = WWW::phpBB->new(
base_url => 'http://localhost/~stefan/index.php?mforum=de',
db_host => 'localhost',
db_user => 'stefan',
db_passwd => 'somepass',
db_database => 'stefan',
db_prefix => 'phpbb3_',
post_date_format => qr/(\d+)\s+(\w+),\s+(\d+)\s+(\d+):(\d+)/,
post_date_pos => [qw(day_of_month month_name year hour minutes)],
forum_user => 'raDical',
forum_passwd => 'lfdiugyh',
);
$phpbb->empty_tables();
$phpbb->forum_login();
$phpbb->get_users();
$phpbb->forum_logout();
$phpbb->scrape_forum_common();
# update an already scraped forum, maybe as a daily cron job
# $phpbb->update_overwrite(1); # don't try to keep modified data
$phpbb->update_users();
$phpbb->update_forum_common();
DESCRIPTION
This module can be used to scrape a phpBB instalation using the web interface. It requires a local phpBB setup that will be overwritten and it can only access what is available to the web browser (no private messages or user settings). Scraping is possible as a guest or as a loged in member. If used with an administrator name and password it will copy all the member e-mails (not just the public ones) allowing them to request a new random password from the new installation site and continue using the forum. The current implementation lacks search support, but this problem will disappear if you convert the forum to SMF. The "mforum" script is supported.
REQUIRED MODULES
EXPORT
None.
CONSTRUCTOR
new()
Creates a new WWW::phpBB object.
Required parameters:
base_url => $forum_url
URL of the original forum.
db_host => $mysql_server
Location of the mysql server where the forum will be copied to.
db_user => $mysql_user
db_passwd => $mysql_pass
db_database => $mysql_db
Database with an already installed phpBB forum.
db_prefix => $
Prefix used by the local installation.
Optional parameters:
db_compression => [0|1]
Compress mysql trafic (only useful when using a remote server).
max_rows => $value
Maximum number of rows kept in memory. When the storage array reaches this value, the data is commited to the database.
months => [qw(jan feb mar apr may jun jul aug sep oct nov dec)]
Month names as used by the forum. They vary with the translation used. The default is for the english version.
post_date_format => regex
Date format used in posts. The default is qr/(\w+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)\s+(\w\w)/ and matches strings like "Tue May 30, 2006 5:17 pm" - note that the leading day of the week is ignored as it's not necessary to compute the timestamp.
post_date_pos => [qw(month_name day_of_month year hour minutes am_pm)]
Position of the elements in the date string. The number of items must match the number of parantesis in "post_date_format". Valid field names are:
am_pm - [am|pm] - case insensitive
month_name - must be one of the values in "months"
month - number of month. Has values from 1 to 12
year
hour
minutes
seconds
reg_date_format => regex
reg_date_pos => []
Same requirements as for the post date, only that they refer to the registration date as it appears in the memberlist.
max_tries => $value
How many times to try fetching a forum page until giving up.
max_children => $value
How many parallel processes should be used for fetching. Defaults to 2.
db_empty => [qw(users categories forums topics posts posts_text vote_desc vote_results)]
Tables that will be epmtied before scraping. The administrator of the local forum will be kept, anything else is deleted. This parameter is not used when updating.
db_insert => [0|1]
Insert scraped data into the database. Defaults to 1.
update_overwrite => [0|1]
Overwrite existing data when updating. Defaults to 0.
ACCESSORS
The accessors have the same name as the constructor parameters. If called without a param, they return the value. With a param, they set a value.
$phpbb->max_rows(100);
print $phpbb->max_tries, "\n";
PUBLIC METHODS
$phpbb->empty_tables()
Empties the tables af a local phpBB installation. It leaves the admin account untouched.
$phpbb->forum_login()
Login into the original forum. Useful when access is restricted for a guest.
$phpbb->forum_logout()
$phpbb->get_users()
Scrape user data from the memberlist and profile pages.
$phpbb->scrape_forum_common()
Scrape categories, forums, topics and posts.
$phpbb->update_users()
Update the users for an already scraped forum.
$phpbb->update_forum_common()
Update categories, forums, topics and posts for an already scraped forum.
AUTHOR
Stefan Talpalaru, <stefantalpalaru@yahoo.com>
COPYRIGHT AND LICENSE
Copyright (C) 2006 by Stefan Talpalaru
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.