NAME
Lingua::JA::FindDates - scan text to find Japanese dates
SYNOPSIS
To find and replace Japanese dates in a string,
use Lingua::JA::FindDates 'subsjdate';
# Given a string, find and substitute all the Japanese dates in it.
my $dates = '昭和41年三月16日';
print subsjdate ($dates);
prints
March 16, 1966
Find and substitute Japanese dates within a string:
my $dates = 'blah blah blah 三月16日';
print subsjdate ($dates);
prints
blah blah blah March 16
subsjdate
can also call back a user-supplied routine each time a date is found:
sub replace_callback
{
my ($data, $before, $after) = @_;
print "$before was replaced by $after\n";
}
my $dates = '三月16日';
my $data = 'xyz'; # something to send to replace_callback
subsjdate ($dates, {replace => \&replace_callback, data => $data});
prints
三月16日 was replaced by March 16
You can also use a routine to format the date any way, letting subsjdate
print it for you:
sub my_date
{
my ($date) = @_;
return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});
This prints
3/16
DESCRIPTION
This module uses a set of regular expressions to detect Japanese-style dates in a string. It recognizes typical Japanese year/month/day-style dates such as 平成20年七月十日 Heisei nijuunen shichigatsu tooka. It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years, parts of the month like 中旬 (chuujun, the middle of the month), and periods between two dates.
- Matches 99.99% of Japanese dates
-
This module has been road-tested on hundreds of documents, and it can cope with virtually any kind of common Japanese date. If you find that it cannot identify some kind of date within Japanese text, please report that as a bug.
More examples
If you would like to see more examples of how this module works, look at the testing code in t/Lingua-JA-FindDates.t
.
Exports
This module exports one function, subsjdate, on request.
kanji2number
- kanji2number ($knum)
-
kanji2number
is a very simple kanji number convertor. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.This function is not exported.
Bugs
kanji2number only goes up to thousands, because usually dates only go that far. If you need a comprehensive Japanese number convertor, we recommend using Lingua::JA::Numbers instead of this. Also, it doesn't deal with mixed kanji and arabic numbers.
make_date
- make_date ($date)
-
make_date
is the default date-string-making routine. It turns the date information supplied to it into a string representing the date.make_date
is not exported.subsjdate, given a date like 平成20年7月3日(木) (Heisei year 20, month 7, day 3, in other words "Thursday the third of July, 2008"), passes
make_date
a hash reference with values (year =>2008, month => 7, date => 3, wday => 4) for the year, month, date and day of the week.make_date
returns a string, 'Thursday, July 3, 2008'. If some fields of the date aren't defined, for example in the case of a date like 7月3日 (3rd July), the hash values for the keys of the unknown parts of the date, such as year or weekday, will be undefined.To replace the default routine
make_date
with a different format, supply amake_date
callback to subsjdate:sub my_date { my ($date) = @_; return join '/', $date->{month}."/".$date->{date}; } my $dates = '三月16日'; print subsjdate ($dates, {make_date => \&my_date});
This prints
3/16
make_date_interval
This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as make_date.
This function is not exported. It is the default used by subsjdate
. You can use another function instead of this default by supplying a value make_date_interval
as a callback in subsjdate.
$verbose
If you want to see what the module is doing, set
$Lingua::JA::FindDates::verbose = 1;
This makes subsjdate print out each regular expression and reports whether it matched, which looks like this:
Looking for y in ([0-90-9]{4}|[十六七九五四千百二一八三]?千[十六七九五四千百二一八三]*)\h*年
Found '千九百六十六年': Arg 0: 1966 -> '1966'
subsjdate
- subsjdate ($text, $callbacks)
-
"subsjdate", given a string (argument 1) containing some text like 平 成20年7月3日(木), looks through the string using a set of regular expressions, and if it finds anything, it calls make_date to make the equivalent date in English, and then substitutes it into $text:
$text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g
;Users can supply a different date making function. See below.
- text
-
A string, encoded in Perl's internal encoding.
- callbacks
-
The hash reference
$callbacks
can take the following items:- replace
-
If there is a replace value in the callbacks, subsjdate calls it as a subroutine with the data in
$callbacks-
{data}> and the before and after string. - data
-
Any data you want to pass to the replace callback.
- make_date
-
This is a replacement for the make_date function. If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following:
sub mymakedate { my ($date) = @_; return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}]. $date->{year}.'/'.$date->{month}.'/'.$date->{date}; }
Note that you need to check for the hash values for year, month, date, and wday being zero, since subsjdate matches "month/day" and "year/month" only dates.
- make_date_interval
-
This is a replacement for the make_date_interval function. Its arguments are two dates.
Bugs
- No sanity check of Japanese era dates
-
It does not detect that dates like 昭和百年 (Showa 100, an impossible year) are invalid.
- Only goes back to Meiji
-
The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.
- Doesn't find dates in order
-
The dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones.
- UTF-8 version only
-
This module only understands Japanese encoded in Perl's internal form (UTF-8).
- Trips a bug in Perl 5.10
-
If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.
- Doesn't do 元日 (ganjitsu)
-
This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.
Author
Ben Bullock, benkasminbullock@gmail.com
Motivation
The motivation for creating this module was as a form of assistance for translation of documents from Japanese into English, especially documents containing a large number of dates.
See also
These other modules might be more suitable for some purposes:
- DateTime::Locale::JA
-
This does the minimal stuff to make a Japanese date. One of those modules which has been made for completeness rather than for usefulness, it doesn't represent Japanese language usages very well, failing to contain Japanese eras, kanji numbers, wide numbers, etc.
- DateTime::Format::Japanese
-
This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times. However, the module seems to be broken - it doesn't install on any system I've tried.
- Lingua::JA::Numbers
-
This module has a very full set of kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.
- DateTime::Calendar::Japanese::Era
-
This module contains a full set of Japanese eras.
COPYRIGHT AND LICENCE
Copyright (C) 2008 Ben Kasmin Bullock.
This module is distributed under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 62:
=back without =over