NAME
DBD::google - Treat Google as a datasource for DBI
SYNOPSIS
use DBI;
my $dbh = DBI->connect("dbi:google:", $KEY);
my $sth = $dbh->prepare(qq[ SELECT title, URL FROM google WHERE q = "perl" ]);
while (my $r = $sth->fetchrow_hashref) {
...
DESCRIPTION
DBD::google allows you to use Google as a datasource; google can be queried using SQL SELECT statements, and iterated over using standard DBI conventions.
WARNING: This is still alpha-quality software. It works for me, but that doesn't really mean anything.
WHY?
For general queries, what better source of information is there than Google?
BASIC USAGE
For the most part, use DBD::google
like you use any other DBD, except instead of going through the trouble of building and installing (or buying!) database software, and employing a DBA to manage your data, you can take advantage of Google's ability to do this for you. Think of it as outsourcing your DBA, if you like.
Connection Information
The connection string should look like: dbi:google:
. DBI requires the trailing :
.
Your Google API key should be specified in the username portion (the password is currently ignored; do whatever you want with it, but be warned that I might put that field to use some day):
my $dbh = DBI->connect("dbi:google:", "my key", undef, \%opts);
Alternatively, you can specify a filename in the user portion; the first line of that file will be treated as the key:
my $dbh =DBI->connect("dbi:google:",
File::Spec->catfile($ENV{HOME}, ".googlekey"))
In addition to the standard DBI options, the fourth argument to connect can also include the following DBD::google
specific options, the full details of each of which can be found in Net::Google:
- ie
-
Input Encoding. String, e.g., "utf-8".
- oe
-
Output Encoding. String, e.g., "utf-8".
- safe
-
Should safe mode be on. Boolean.
- filter
-
Should results be filtered. Boolean.
- lr
-
Something to do with language. Arrayref.
- debug
-
Should
Net::Google
be put into debug mode or not. Boolean.
Supported SQL Syntax and Random Notes Thereon
The only supported SQL statement type is the SELECT statement. Since there is no real "table" involved, I've created a hypothetical table, called google; this table has one queryable field, q (just like the public web-based interface). The available columns are currently dictated by the data available from the underlying transport, which is the Google SOAP API (see http://www.google.com/apis), as implemented by Aaron Straup Cope's Net::Google
module.
The basic SQL syntax supported looks like:
SELECT @fields FROM google WHERE q = '$query'
There is also an optional LIMIT clause, the syntax of which is similar to that of MySQL's LIMIT clause; it takes a pair: offset from 0, number of results. In practice, Google returns 10 results at a time by default, so specifying a high LIMIT clause at the beginning might make sense for many queries.
The list of available fields in the google table includes:
- title
-
Returns the title of the result, as a string.
- URL
-
Returns the URL of the result, as a (non-HTML encoded!) string.
- snippet
-
Returns a snippet of the result.
- cachedSize
-
Returns a string indicating the size of the cached version of the document.
- directoryTitle
-
Returns a string.
- summary
-
Returns a summary of the result.
- hostName
-
Returns the hostname of the result.
- directoryCategory
-
Returns the directory category of the result.
The column specifications can include aliases:
SELECT directoryCategory as DC FROM google WHERE...
Finally, there are a few function that can be called on fields:
SELECT title, html_encode(url) FROM google WHERE q = '$stuff'
The available functions include:
- uri_escape
-
This comes from the
URI::Escape
module. - html_escape
-
This wraps around
HTML::Entities::encode_entities
. - html_strip
-
This removes HTML from a field. Some fields, such as title, summary, and snippet, have the query terms highlighted with <b> tags by Google; this function can be used to undo that damage.
Functions an aliases can be combined:
SELECT html_strip(snippet) as stripped_snippet FROM google...
Unsupported SQL includes ORDER BY clauses (Google does this, and provides no interface to modify it), HAVING clauses, JOINs of any type (there's only 1 "table" after all), sub-SELECTS (I can't even imagine of what use they would be here), and, actually, anything not explicitly mentioned above.
INSTALLATION
DBD::google
is pure perl, and has a few module requirements:
- Net::Google
-
This is the heart of the module;
DBD::google
is basically a DBI-compliant wrapper aroundNet::Google
. - HTML::Entities, URI::Escape
-
These two modules provide the uri_escape and html_escape functions.
- DBI
-
Duh.
To install:
$ perl Makefile.PL
$ make
$ make test
# make install
$ echo 'I love your module!' | mail darren@cpan.org -s "DBD::google"
The last step is optional; the others are not.
EXAMPLES
Here is a complete script that takes a query from the command line and formats the results nicely:
#!/usr/bin/perl -w
use strict;
use DBI;
use Text::TabularDisplay;
my $query = "@ARGV" || "perl";
# Set up SQL statement -- note the multiple lines
my $sql = qq~
SELECT
title, URL, hostName
FROM
google
WHERE
q = "$query"
~;
# DBI/DBD options:
my %opts = ( RaiseError => 1, # Standard DBI options
PrintError => 0,
lr => [ 'en' ], # DBD::google options
oe => "utf-8",
ie => "utf-8",
);
# Get API key
my $keyfile = glob "~/.googlekey";
# Get database handle
my $dbh = DBI->connect("dbi:google:", $keyfile, undef, \%opts);
# Create Text::TabularDisplay instance, and set the columns
my $table = Text::TabularDisplay->new;
$table->columns("Title", "URL, "Hostname");
# Do the query
my $sth = $dbh->prepare($sql);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
$table->add(@row);
}
$sth->finish;
print $table->render;
TODO
These are listed in the order in which I'd like to implement them.
- More tests!
-
I'm particularly unimpressed with the test suite for the SQL parser; I think it is pretty pathetic. It needs much better testing, with more edge cases and more things I'm not expecting to find.
I've specifically avoided including tests that actually query Google, because the free API keys have a daily limit to the number of requests that will be answered. My original test suite did a few dozen queries each time it was run; if you run the test suite a few dozen times in a day (easy to do if you are actively developing the software or changing the feature set), your daily quota can be eaten up very easily.
- Integration of search metadata
-
There are several pieces of metadata that come back with searches; access to the via the statement handle ($sth) would be nice:
my $search_time = $sth->searchTime(); my $total = $sth->estimatedTotalResultsNumber();
The metadata includes:
- o
-
documentFiltering
- o
-
searchTime
- o
-
estimatedTotalResultsNumber
- o
-
estimateIsExact
- o
-
searchTips
- o
-
searchTime
These are described in Net::Google::Response.
- Extensible functions
-
Unknown functions that look like Perl package::function names should probably be treated as such, and AUTOLOADed:
SELECT Foo::frob(title) FROM google WHERE q = "perl"
Would do, effectively:
require Foo; $title = Foo::frob($title);
I'm slightly afraid of where this could lead, though:
SELECT title, LWP::Simple::get(url) as WholeDamnThing FROM google WHERE q = "perl apache" LIMIT 0, 100
- Elements return objects, instead of strings
-
It would be interesting for columns like URL and hostName to return
URI
andNet::hostent
objects, respectively.On the other hand, this is definitely related to the previous item; the parser could be extended to accept function names in method format:
SELECT title, URI->new(URL), Net::hostent->new(hostName) FROM google WHERE q = "perl"
- DESCRIBE statement on the
google
table -
It would be nice to provide a little introspection.
CAVEATS, BUGS, IMPROVEMENTS, SUGGESTIONS, FOIBLES, ETC
I've only tested this using my free, 1000-uses-per-day API key, so I don't know how well/if this software will work for those of you who have purchased real licenses for unlimited usage.
Placeholders are currently unsupported. They won't do any good, but would be nice to have for consistency with other DBDs. I'll get around to it someday.
There are many Interesting Things that can be done with this module, I think -- suggestions as to what those things might actually be are very welcome. Patches implementing said Interesting Things are also welcome, of course.
More specifically, queries that the SQL parser chokes on would be very useful, so I can refine the test suite (and the parser itself, of course).
There are probably a few bugs, though I don't know of any. Please report them via the DBD::google queue at <http://rt.cpan.org/>.
SEE ALSO
DBI, DBI::DBD, Net::Google, URI::Escape, HTML::Entities
AUTHOR
darren chamberlain <darren@cpan.org>