NAME

Geo::StreetAddress::US - Perl extension for parsing US street addresses

SYNOPSIS

  use Geo::StreetAddress::US;

  $hashref = Geo::StreetAddress::US->parse_location(
		"1005 Gravenstein Hwy N, Sebastopol CA 95472" );

  $hashref = Geo::StreetAddress::US->parse_location(
		"Hollywood & Vine, Los Angeles, CA" );

  $hashref = Geo::StreetAddress::US->parse_address(
		"1600 Pennsylvania Ave, Washington, DC" );

  $hashref = Geo::StreetAddress::US->parse_address(
		"1600 Pennsylvania Ave, Washington, DC" );

  $hashref = Geo::StreetAddress::US->parse_informal_address(
		"Lot 3 Pennsylvania Ave" );

  $hashref = Geo::StreetAddress::US->parse_intersection(
		"Mission Street at Valencia Street, San Francisco, CA" );

  $hashref = Geo::StreetAddress::US->normalize_address( \%spec );
      # the parse_* methods call this automatically...

DESCRIPTION

Geo::StreetAddress::US is a regex-based street address and street intersection parser for the United States. Its basic goal is to be as forgiving as possible when parsing user-provided address strings. Geo::StreetAddress::US knows about directional prefixes and suffixes, fractional building numbers, building units, grid-based addresses (such as those used in parts of Utah), 5 and 9 digit ZIP codes, and all of the official USPS abbreviations for street types, state names and secondary unit designators.

RETURN VALUES

Most Geo::StreetAddress::US methods return a reference to a hash containing address or intersection information. This "address specifier" hash may contain any of the following fields for a given address. If a given field is not present in the address, the corresponding key will be set to undef in the hash.

Future versions of this module may add extra fields.

ADDRESS SPECIFIER

number

House or street number.

prefix

Directional prefix for the street, such as N, NE, E, etc. A given prefix should be one to two characters long.

street

Name of the street, without directional or type qualifiers.

type

Abbreviated street type, e.g. Rd, St, Ave, etc. See the USPS official type abbreviations at http://pe.usps.com/text/pub28/pub28apc.html for a list of abbreviations used.

suffix

Directional suffix for the street, as above.

city

Name of the city, town, or other locale that the address is situated in.

state

The state which the address is situated in, given as its two-letter postal abbreviation. for a list of abbreviations used.

zip

Five digit ZIP postal code for the address, including leading zero, if needed.

sec_unit_type

If the address includes a Secondary Unit Designator, such as a room, suite or appartment, the sec_unit_type field will indicate the type of unit.

sec_unit_num

If the address includes a Secondary Unit Designator, such as a room, suite or appartment, the sec_unit_num field will indicate the number of the unit (which may not be numeric).

INTERSECTION SPECIFIER

prefix1, prefix2

Directional prefixes for the streets in question.

street1, street2

Names of the streets in question.

type1, type2

Street types for the streets in question.

suffix1, suffix2

Directional suffixes for the streets in question.

city

City or locale containing the intersection, as above.

state

State abbreviation, as above.

zip

Five digit ZIP code, as above.

GLOBAL VARIABLES

Geo::StreetAddress::US contains a number of global variables which it uses to recognize different bits of US street addresses. Although you will probably not need them, they are documented here for completeness's sake.

%Directional

Maps directional names (north, northeast, etc.) to abbreviations (N, NE, etc.).

%Direction_Code

Maps directional abbreviations to directional names.

%Street_Type

Maps lowercased USPS standard street types to their canonical postal abbreviations as found in TIGER/Line. See eg/get_street_abbrev.pl in the distrbution for how this map was generated.

%State_Code

Maps lowercased US state and territory names to their canonical two-letter postal abbreviations. See eg/get_state_abbrev.pl in the distrbution for how this map was generated.

%State_FIPS

Maps two-digit FIPS-55 US state and territory codes (including the leading zero!) as found in TIGER/Line to the state's canonical two-letter postal abbreviation. See eg/get_state_fips.pl in the distrbution for how this map was generated. Yes, I know the FIPS data also has the state names. Oops.

%Addr_Match

A hash of compiled regular expressions corresponding to different types of address or address portions. Defined regexen include type, number, fraction, state, direct(ion), dircode, zip, corner, street, place, address, and intersection.

Direct use of these patterns is not recommended because they may change in subtle ways between releases.

$Old_Undef_Fields_Behaviour

Restores the pre version 1.00 behaviour for unmatched fields. Normally unmatched fields don't exist in the result hash. If this variable is set true, some unmatched fields are returned with undef values, instead of not existing in the hash at all. This mechanism is a temporary measure to aid migration and may be removed in a future version.

CLASS METHODS

init

# Add another street type mapping:
$Geo::StreetAddress::US::Street_Type{'cur'}='curv';
# Re-initialize to pick up the change
Geo::StreetAddress::US::init();

Runs the setup on globals. This is run automatically when the module is loaded, but if you subsequently change the globals, you should run it again.

parse_location

$spec = Geo::StreetAddress::US->parse_location( $string )

Parses any address or intersection string and returns the appropriate specifier. If $string matches the $Addr_Match{corner} pattern then parse_intersection() is used. Else parse_address() is called and if that returns false then parse_informal_address() is called.

parse_address

$spec = Geo::StreetAddress::US->parse_address( $address_string )

Parses a street address into an address specifier using the $Addr_Match{address} pattern. Returning undef if the address cannot be parsed as a complete formal address.

You may want to use parse_location() instead.

parse_informal_address

$spec = Geo::StreetAddress::US->parse_informal_address( $address_string )

Acts like parse_address() except that it handles a wider range of address formats because it uses the "informal_address" pattern. That means a unit can come first, a street number is optional, and the city and state aren't needed. Which means that informal addresses like "#42 123 Main St" can be parsed.

Returns undef if the address cannot be parsed.

You may want to use parse_location() instead.

parse_intersection

$spec = Geo::StreetAddress::US->parse_intersection( $intersection_string )

Parses an intersection string into an intersection specifier, returning undef if the address cannot be parsed. You probably want to use parse_location() instead.

normalize_address

$spec = Geo::StreetAddress::US->normalize_address( $spec )

Takes an address or intersection specifier, and normalizes its components, stripping out all leading and trailing whitespace and punctuation, and substituting official abbreviations for prefix, suffix, type, and state values. Also, city names that are prefixed with a directional abbreviation (e.g. N, NE, etc.) have the abbreviation expanded. The original specifier ref is returned.

Typically, you won't need to use this method, as the parse_*() methods call it for you.

N.B., normalize_address() crops 9-digit ZIP codes to 5 digits. This is for the benefit of Geo::Coder::US and may not be what you want. E-mail me if this is a problem and I'll see what I can do to fix it.

BUGS, CAVEATS, MISCELLANY

Geo::StreetAddress::US might not correctly parse house numbers that contain hyphens, such as those used in parts of Queens, New York. Also, some addresses in rural Michigan and Illinois may contain letter prefixes to the building number that may cause problems. Fixing these edge cases is on the to-do list, to be sure. Patches welcome!

This software was originally part of Geo::Coder::US (q.v.) but was split apart into an independent module for your convenience. Therefore it has some behaviors which were designed for Geo::Coder::US, but which may not be right for your purposes. If this turns out to be the case, please let me know.

Geo::StreetAddress::US does NOT perform USPS-certified address normalization.

Grid based addresses, like those in Utah, where the direction comes before the number, e.g. W164N5108 instead of 164 W 5108 N, aren't handled at the moment. A workaround is to apply a regex like this

s/([nsew])\s*(\d+)\s*([nsew])\s*(\d+)/$2 $1 $4 $3/

SEE ALSO

This software was originally part of Geo::Coder::US(3pm).

Lingua::EN::AddressParse(3pm) and Geo::PostalAddress(3pm) both do something very similar to Geo::StreetAddress::US, but are either too strict/limited in their address parsing, or not really specific enough in how they break down addresses (for my purposes). If you want USPS-style address standardization, try Scrape::USPS::ZipLookup(3pm). Be aware, however, that it scrapes a form on the USPS website in a way that may not be officially permitted and might break at any time. If this module does not do what you want, you might give the othersa try. All three modules are available from the CPAN.

You can see Geo::StreetAddress::US in action at http://geocoder.us/.

USPS Postal Addressing Standards: http://pe.usps.com/text/pub28/welcome.htm

APPRECIATION

Many thanks to Dave Rolsky for submitting a very useful patch to fix fractional house numbers, dotted directionals, and other kinds of edge cases, e.g. South St. He even submitted additional tests!

AUTHOR

Schuyler D. Erle <schuyler@geocoder.us>

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Schuyler D. Erle.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.