NAME

Lingua::JA::FindDates - find Japanese dates & convert them

SYNOPSIS

Find and replace Japanese dates in a string.

use Lingua::JA::FindDates 'subsjdate';

Given a string, find and substitute all the Japanese dates in it.

my $dates = '昭和41年三月16日';
print subsjdate ($dates);

prints

March 16, 1966
Find dates within a string

This module finds dates and substitutes dates inside a string:

my $dates = 'blah blah blah 三月16日';
print subsjdate ($dates);

prints

blah blah blah March 16

It can call back a routine each time a date is found:

sub replace_callback
{
  my ($data, $before, $after) = @_;
  print "$before was replaced by $after\n";
}
my $dates = '三月16日';
my $data = 'xyz'; # something to send to replace_callback
subsjdate ($dates, \&replace_callback, $data);

prints

三月16日 was replaced by March 16

Use make_date_callback to format the date any way:

sub my_date
{
  return join '/', @_[1,2];
}
my $dates = '三月16日';
print subsjdate ($dates, undef, undef, \&my_date);

prints

3/16

DESCRIPTION

This module uses a set of regular expressions to detect Japanese-style dates in a string. Dates includes year/month/day-style dates such as 平 成20年七月十日 Heisei nij?nent?ka, but may also include combinations such as years alone, years and months, month and day without a year, fiscal years, parts of the month like 中旬 (ch?jun), and periods of time.

Matches 99.99% of Japanese dates

This module has been road-tested on hundreds of documents, and it can cope with virtually any kind of Japanese date. If you find any date which it can't cope with, please report that as a bug.

More examples

If you would like to see more examples of how this module works, look at the testing code in t/Lingua-JA-FindDates.t.

Exports

This module can export two functions, subsjdate and kanji2number, on request.

kanji2number

kanji2number ($knum)

kanji2number is a very simple kanji number convertor. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis.

The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.

Bugs

kanji2number only goes up to thousands, because usually dates only go that far. If you need a comprehensive Japanese number convertor, use Lingua::JA::Numbers instead of this. Also, it doesn't deal with mixed kanji and arabic numbers.

Matching patterns

The module can be used without reading this section.

The Japanese date regular expressions are stored in an array jdatere containing a pair of a regular expression to match a kind of date, and a string like "ymdw" which contains letters saying what to do with $1, $2, etc. from the regular expression. The array jdatere is ordered from longest match (like "year / month / day / weekday") to shortest (like "year" only). For example, if the first letter is "y", then $1 is a year in Western format like 2008, or if the third letter is "w", then $3 is the day of the week, from 1 to 7.

e

Japanese era (string).

j

Japanese year (string representing small number)

x

empty month and day

m

month number (from 1 to 12, 13 for a blank month, 0 for an invalid month)

d

day of month (from 1 to 31, 0 for an invalid day)

w

weekday (from Monday = 1 to Sunday = 7, zero or undefined for an invalid weekday)

z

jun (旬), a ten day period.

1

After another code, indicates the first of a pair of two things. For example, the matching code for

平成9年10月17日〜20日

is

ejmd1d2

make_date

make_date ($year, $month1, $date1, $wday1, $jun, $month2, $date2, $wday2)

This is the default date making routine. It's not exported.

subsjdate, given a date like 平成20年7月3日(木), passes this routine the values (2008, 7, 3, 4) for the year, month, date and day of the week respectively. Then this makes a string 'Thursday, July 3, 2008' and returns it to subsjdate.

You can use any other format for the date by supplying your own make_date_callback routine to subsjdate.

$verbose

If you want to see what the module is doing, set

$Lingua::JA::FindDates::verbose = 1;

This makes subsjdate print out each regular expression and reports whether it matched, which looks like this:

Looking for y in ([0-90-9]{4}|[十六七九五四千百二一八三]?千[十六七九五四千百二一八三]*)\s*年
Found '千九百六十六年': Arg 0: 1966 -> '1966'

subsjdate

subsjdate ($text, $replace_callback, $data, $make_date_callback)

"subsjdate", given a string (argument 1) containing some text like 平 成20年7月3日(木), looks through the string using a set of regular expressions, and if it finds anything, it calls make_date to make the equivalent date in English, and then substitutes it into $text:

$text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;

Users can supply a different date making function. See make_date_callback.

text

Argument one is a string of Japanese, encoded in Perl's internal encoding.

replace_callback

If there is a replace_callback value in argument two, it calls that with the data in argument 3 and the before and after string. If you don't want to call anything, you can leave this blank. In the original script, see history, replace_callback is a function which calls Microsoft Word via Win32::OLE.

data

Argument three is any data you want to pass to replace_callback. In my original version, this is a reference to a hash which contains the document object to pass to Word.

make_date_callback

Argument four is your replacement for the make_date function. If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following:

sub mymakedate
{
    my ($year, $month, $date, $wday) = @_;
    return qw{Bad Mo Tu We Th Fr Sa Su}[$wday]." $year/$month/$date";
}

In practice you need to check for $year, $month, $date, and $wday being zero, since subsjdate matches "month/day" and "year/month" only dates.

Bugs

No sanity check of Japanese era dates

It does not detect that dates like 昭和百年 (Showa 100, an impossible year) are invalid.

Only goes back to Meiji

The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.

Doesn't find dates in order

The dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one.

UTF-8 version only

This module only understands Japanese encoded in Perl's internal form (UTF-8).

Author

Ben Bullock, benkasminbullock@gmail.com

History

This routine started life as a Visual Basic for Applications (VBA) script (a "Word Macro") to automatically convert Japanese dates in a Microsoft Word document into their equivalent English versions. See http://linuxtnt.wordpress.com/2008/04/16/visual-basic-date-translator-updated/ . Eventually, because I kept finding exceptions & I didn't know Visual Basic well enough to code that efficiently, I decided to rewrite it all in Perl, using Win32::OLE to automate the operation of Microsoft Word. (The Microsoft Word handlers are not included in this module.)

The basic idea is to ask Word to save a copy of the file as text via OLE, then read the text file in to Perl, look for dates in the text using subsjdate, and then call back into Microsoft Word using the replace_callback argument to subsjdate to substitute the Japanese dates with English ones.

See also

These other modules might be more suitable for some purposes:

DateTime::Locale::JA

This does the minimal stuff to make a Japanese date. One of those modules which has been made for completeness rather than for usefulness, it doesn't represent Japanese language usages very well, failing to contain Japanese eras, kanji numbers, wide numbers, etc.

DateTime::Format::Japanese

This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times. However, the module seems to be broken - it doesn't install on any system I've tried.

Lingua::JA::Numbers

This module has a very full set of kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.

DateTime::Calendar::Japanese::Era

This module contains a full set of Japanese eras in romaji.

COPYRIGHT AND LICENCE

Copyright (C) 2008 Ben Kasmin Bullock. All rights reserved.

This module is distributed under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 22:

'=item' outside of any '=over'

Around line 61:

You forgot a '=back' before '=head1'

Around line 70:

'=item' outside of any '=over'

Around line 296:

Expected text after =item, not a number