NAME
Lingua::JA::FindDates - scan text to find dates in a Japanese format
SYNOPSIS
# Find and replace Japanese dates:
use Lingua::JA::FindDates 'subsjdate';
# Given a string, find and substitute all the Japanese dates in it.
my $dates = '昭和41年三月16日';
print subsjdate ($dates);
# prints "March 16, 1966"
# Find and substitute Japanese dates within a string:
my $dates = 'blah blah blah 三月16日';
print subsjdate ($dates);
# prints "blah blah blah March 16"
# subsjdate can also call back a user-supplied routine each time a
# date is found:
sub replace_callback
{
my ($data, $before, $after) = @_;
print "'$before' was replaced by '$after'.\n";
}
my $dates = '三月16日';
my $data = 'xyz'; # something to send to replace_callback
subsjdate ($dates, {replace => \&replace_callback, data => $data});
# prints "'三月16日' was replaced by 'March 16'."
# A routine can be used to format the date any way, letting C<subsjdate>
# print it:
sub my_date
{
my ($data, $original, $date) = @_;
return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});
# This prints "3/16"
DESCRIPTION
This module offers pattern matching of dates in the Japanese language. Its main routine, "subsjdate" scans a text and finds things which appear to be Japanese dates.
The module recognizes the typical format of dates with the year first, followed by the month, then the day, such as 平成20年七月十日 (Heisei nijūnen shichigatsu tōka). It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years (年度, "nendo"), parts of the month, like 中旬 (chūjun, the middle of the month), and periods between two dates.
It recognizes both the Japanese-style era-base year format, such as "平成24年" (Heisei) for the current era, and European-style Christian era year format, such as 2012年. It recognizes several forms of numerals, including the ordinary ASCII numerals, 1, 2, 3; the "wide" or "double width" numerals sometimes used in Japan, 1, 2, 3; and the kanji-based numeral system, 一, 二, 三. It recognizes some special date formats such as 元年 for the first year of an era. It recognizes era names identified by their initial letters, such as S41年 for Shōwa 41 (1966). It recognizes dates regardless of any spacing which might be inserted between individual Japanese characters, such as "平 成 二 十 年 八 月".
The input text must be marked as Unicode, in other words character data, not byte data.
The module has been tested on several hundreds of documents, and it should cope with all common Japanese dates. If you find that it cannot identify some kind of date within Japanese text, please report that as a bug.
FUNCTIONS
subsjdate
my $translation = subsjdate ($text);
Translate Japanese dates into American dates. The first argument to subsjdate
is a string like "平成20年7月3日(木)". The routine looks through the string to see if there is anything which appears to be a Japanese date. If it finds one, it calls "make_date" to make the equivalent date in English (American-style), and then substitutes it into $text
, as if performing the following type of operation:
$text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;
Users can supply a different date-making function using the second argument. The second argument is a hash reference which may have the following members:
- replace
-
subsjdate ($text, {replace => \&my_replace, data => $my_data}); # Now "my_replace" is called as # my_replace ($my_data, $before, $after);
If there is a replace value in the callbacks, subsjdate calls it as a subroutine with the data in
$callbacks->{data}
and the before and after string, in other words the matched date and the string with which it is to be replaced. - data
-
Any data you want to pass to "replace", above.
- make_date
-
subsjdate ($text, {make_date => \& mymakedate});
This is a replacement for the default "make_date" function. The default function turns the Japanese dates into American-style dates, so, for example, "平成10年11月12日" is turned into "November 12, 1998". If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following:
sub mymakedate { my ($data, $original, $date) = @_; return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}]. $date->{year}.'/'.$date->{month}.'/'.$date->{date}; }
Your routine will be called in the same way as the default routine, "make_date". It is necessary to check for the hash values for the fields
year
,month
,date
, andwday
being zero, since "subsjdate" matches "month/day" and "year/month" only dates.$data
is any data which is passed in to "subsjdate".$original
is the original text. - make_date_interval
-
This is a replacement for the make_date_interval function.
subsjdate ($text, {make_date_interval => \&mymakedateinterval});
Your routine is called in the same way as the default routine, "make_date_interval". Its arguments are
$data
and$original
as formake_date
, and the two dates in the form of hash references with the same keys as formake_date
.
kanji2number
kanji2number ($knum)
kanji2number
is a simple kanji number convertor for use with dates. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.
kanji2number only goes up to thousands, because usually dates only go that far. For a more comprehensive Japanese number convertor, see Lingua::JA::Numbers.
DEFAULT CALLBACKS
make_date
# Monday 19th March 2012.
make_date ({
year => 2012,
month => 3,
date => 19,
wday => 1,
})
make_date
is the default date-string-making routine. It turns the date information supplied to it into a string representing the date. make_date
is not exported.
subsjdate, given a date like 平成20年7月3日(木) (Heisei year 20, month 7, day 3, in other words "Thursday the third of July, 2008"), passes make_date
a hash reference with values year => 2008, month => 7, date => 3, wday => 4
for the year, month, date and day of the week. make_date
returns a string, 'Thursday, July 3, 2008'. If some fields of the date aren't defined, for example in the case of a date like 7月3日 (3rd July), the hash values for the keys of the unknown parts of the date, such as year or weekday, are undefined.
To replace the default routine make_date
with a different format, supply a make_date
callback to subsjdate:
sub my_date
{
my ($data, $original, $date) = @_;
return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});
This prints
3/16
make_date_interval
make_date_interval (
{
# 19 February 2010
year => 2010,
month => 2,
date => 19,
},
# Monday 19th March 2012.
{
year => 2012,
month => 3,
date => 19,
wday => 1,
},);
This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as "make_date".
This function is not exported. It is the default used by "subsjdate". You can use another function instead of this default by supplying a value make_date_interval
as a callback in "subsjdate".
BUGS
The following special cases are not covered.
- Doesn't do 元日 (ganjitsu)
-
This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.
- 10月第4月曜日
-
"10月第4月曜日", which means "the fourth Monday of October", comes out as "October第April曜日".
- 今年6月
-
The module does not handle things like 今年 (this year), 去年 (last year), or 来年 (next year).
- 末日
-
The module does not handle "末日" (matsujitsu) "the last day" (of a month).
- 土日祝日
-
The module does not handle "土日祝日" (weekends and holidays).
- 年末年始
-
The module does not handle "年末年始" (the new year period).
Please also note the following:
- No sanity check of Japanese era dates
-
It does not detect that dates like 昭和百年 (Showa 100, an impossible year, since Showa 63 (1988) was succeeded by Heisei 1 (1989)) are invalid.
- Only goes back to Meiji
-
The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.
- Doesn't find dates in order
-
For those supplying their own callback routines, note that the dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones.
- UTF-8 version only
-
This module only understands Japanese encoded in Perl's internal form (UTF-8).
- Trips a bug in Perl 5.10
-
If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.
EXPORTS
This module exports one function, "subsjdate", on request.
SEE ALSO
- DateTime::Locale::JA
-
Minimal selection of Japanese date functions. It's not complete enough to deal with the full range of dates in actual documents.
- DateTime::Format::Japanese
-
This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times.
- Lingua::JA::Numbers
-
Kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.
- DateTime::Calendar::Japanese::Era
-
A full set of Japanese eras.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT AND LICENCE
Copyright (C) 2008-2012 Ben Bullock.
You may use, copy, distribute, and modify this module under the same terms as the Perl programming language.