NAME

Lingua::JA::FindDates - scan text to find dates in a Japanese format

SYNOPSIS

use utf8;

# Find and replace Japanese dates:

use Lingua::JA::FindDates 'subsjdate';

# Given a string, find and substitute all the Japanese dates in it.

my $dates = '昭和41年三月16日';
print subsjdate ($dates), "\n";

# prints "March 16, 1966"

# Find and substitute Japanese dates within a string:

$dates = 'blah blah blah 三月16日';
print subsjdate ($dates), "\n";

# prints "blah blah blah March 16"

# subsjdate can also call back a user-supplied routine each time a
# date is found:

sub replace_callback
{
    my ($data, $before, $after) = @_;
    print "'$before' was replaced by '$after'.\n";
}
$dates = '三月16日';
my $data = 'xyz';               # something to send to replace_callback
subsjdate ($dates, {replace => \&replace_callback, data => $data});

# prints "'三月16日' was replaced by 'March 16'."

# A routine can be used to format the date any way, letting C<subsjdate>
# print it:

sub my_date
{
    my ($data, $original, $date) = @_;
    return join '/', $date->{month}."/".$date->{date};
}
$dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date}), "\n";

# This prints "3/16"

# Convert Western to Japanese dates
use Lingua::JA::FindDates 'seireki_to_nengo';
print seireki_to_nengo ('1989年1月1日'), "\n";
# This prints "昭和64年1月1日".

produces output

March 16, 1966
blah blah blah March 16
'三月16日' was replaced by 'March 16'.
3/16
昭和64年1月1日

(This example is included as synopsis.pl in the distribution.)

VERSION

This documents version 0.023 of Lingua::JA::FindDates corresponding to git commit 9ac73904449b16be3215c23dcdf58accbdf18345 released on Tue Aug 30 20:17:53 2016 +0900.

DESCRIPTION

This module's main routine, "subsjdate", scans a text and finds things which appear to be Japanese dates.

The module recognizes a variety of date formats. It recognizes the typical format of dates with the year first, followed by the month, then the day, such as 平成20年七月十日 (Heisei nijūnen shichigatsu tōka). It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years (年度, nendo), parts of the month, like 中旬 (chūjun, the middle of the month), and periods between two dates.

The module recognizes both Japanese years, such as "平成24年" (Heisei), and European years, such as 2012年. It recognizes ASCII numerals, 1, 2, 3; the "wide" or "double width" numerals sometimes used in Japan, 1, 2, 3 (see What is "wide ASCII"?); and the kanji-based numeral system, 一, 二,三. It recognizes some special date formats such as 元年 for the first year of an era. It recognizes era names identified by their initial letters, such as S41 年 for Shōwa 41 (1966). It recognizes dates regardless of spacing between characters, such as "平 成 二 十 年 八 月".

The input text must be marked as Unicode, in other words character data, not byte data.

The module has been tested on several hundreds of documents, and it should cope with all common Japanese dates. If you find that it cannot identify some kind of date within Japanese text, please report a bug.

FUNCTIONS

subsjdate

my $translation = subsjdate ($text);

Translate Japanese dates into American dates. The first argument to subsjdate is a string like "平成20年7月3日(木)". The routine looks through the string to see if there is anything which appears to be a Japanese date. If it finds one, it makes an equivalent date in English, and then substitutes it into $text, as if performing the following type of operation:

$text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;

If the text contains the interval between two dates, subsjdate attempts to convert that into an English-language interval.

The default dates are American-style, with the month first. Users can supply a different date-making function using the second argument:

my $translation = subsjdate ($text, {make_date => \mymakedate,
                             make_date_interval => \myinterval});

The second argument is a hash reference which may have the following members:

replace
subsjdate ($text, {replace => \&my_replace, data => $my_data});
# Now "my_replace" is called as
# my_replace ($my_data, $before, $after);

If a code reference is supplied as replace, subsjdate calls it as a subroutine with user-defined data "data", and $before, the matched date and $after the string with which it is to be replaced.

If replace is not supplied, subsjdate substitutes the dates itself.

data

Any data you want to pass to "replace", above. If nothing is supplied, subsjdate simply passes the undefined value.

make_date
subsjdate ($text, {make_date => \& mymakedate});

This is a replacement for the default "default_make_date" function. The default function turns "平成10年11月12日" into "November 12, 1998". To change this to dates in the form "Th 2008/7/3", use a routine like the following:

use utf8;
use Lingua::JA::FindDates 'subsjdate';
sub mymakedate
{
    my ($data, $original, $date) = @_;
    return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}]. " " .
    $date->{year}.'/'.$date->{month}.'/'.$date->{date};
} 
my $input = '山口百恵の誕生日は昭和34年1月17日(土)。中元すず香の誕生日は平成9年12月20日(土)。';
my $output = subsjdate ($input, {make_date => \& mymakedate});
print "$output\n";

produces output

山口百恵の誕生日はSa 1959/1/17。中元すず香の誕生日はSa 1997/12/20。

(This example is included as subsjdate-make-date.pl in the distribution.)

The first two arguments passed to the user-defined routine are $data, user-defined data as described in "data", and $original, the original Japanese-language date. The following argument is the date as a hash reference, with the fields year (Western-style year), month (1-12), date (1-31), and wday (1-7 for Monday to Sunday). Your routine must check whether the fields year, month, date, and wday are defined, since "subsjdate" matches all kinds of dates including year only, month/day only, and year/month only dates.

make_date_interval

This is a replacement for the "default_make_date_interval" function.

subsjdate ($text, {make_date_interval => \&mymakedateinterval});

Your routine is called in the same way as the default routine, "default_make_date_interval". Its arguments are $data and $original as for make_date, and the two dates in the form of hash references with the same keys as for make_date.

use utf8;
use Lingua::JA::FindDates 'subsjdate';
sub crazy_date
{
    my ($date) = @_;
    my $out = "$date->{month}/$date->{date}";
    if ($date->{year}) {
        $out = "$date->{year}/$out";
    }
    return $out;
}
sub myinterval
{
    my ($data, $original, $date1, $date2) = @_;
    # Ignore C<$data> and C<$original>.
    return crazy_date ($date1) . " until " . crazy_date ($date2);
} 
my $input = '昭和34年1月17日〜12月20日。';
#$Lingua::JA::FindDates::verbose = 1;
my $output = subsjdate ($input, {make_date_interval => \& myinterval});
print "$output\n";

produces output

1959/1/17 until 12/20。

(This example is included as subsjdate-make-interval.pl in the distribution.)

kanji2number

kanji2number ($knum)

kanji2number is a simple kanji number convertor for use with dates. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.

kanji2number only goes up to thousands, because usually dates only go that far. For a more comprehensive Japanese number convertor, see Lingua::JA::Numbers.

seireki_to_nengo

use utf8;
use Lingua::JA::FindDates 'seireki_to_nengo';
print seireki_to_nengo ('1989年1月1日');

produces output

昭和64年1月1日

(This example is included as seireki-to-nengo.pl in the distribution.)

This function substitutes Western-style dates with Japanese-style "nengo" dates (年号). The "nengo" dates go back to the Meiji period (1868). See "BUGS".

nengo_to_seireki

use utf8;
use Lingua::JA::FindDates 'nengo_to_seireki';
print nengo_to_seireki ('昭和64年1月1日');

produces output

1989年1月1日

(This example is included as nengo-to-seireki.pl in the distribution.)

This function substitutes Japanese-style "nengo" dates (年号) with Western-style dates. The "nengo" dates go back to the Meiji period (1868). See "BUGS".

DEFAULT CALLBACKS

This section discusses the default subroutines which are called as dates are found to convert the Japanese dates into another string format. These callbacks are not exported. In versions of this module prior to 0.022, these functions were called make_date and make_date_interval respectively. The previous names still work.

default_make_date

"subsjdate", given a date like 平成20年7月3日(木) (Heisei year 20, month 7, day 3, in other words "Thursday the third of July, 2008"), passes make_date a hash reference with values year => 2008, month => 7, date => 3, wday => 4 for the year, month, date and day of the week. make_date turns the date information supplied to it into a string representing the date. make_date is not exported.

Here is an example of how it operates:

use Lingua::JA::FindDates;
my $outdate = Lingua::JA::FindDates::default_make_date ({
    year => 2012,
    month => 3,
    date => 19,
    wday => 1,
});
print "$outdate\n";

produces output

Monday, March 19, 2012

(This example is included as make-date.pl in the distribution.)

To replace the default routine make_date with a different format, supply a make_date callback to "subsjdate":

use utf8;
use Lingua::JA::FindDates 'subsjdate';
sub my_date
{
    my ($data, $original, $date) = @_;
    return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});

produces output

3/16

(This example is included as my-date.pl in the distribution.)

Note that, depending on what dates are in your document, some of the hash values may not be available, so the user routine needs to handle the cases when the year or the month or the day of the month are missing.

default_make_date_interval

use Lingua::JA::FindDates;
print Lingua::JA::FindDates::default_make_date_interval (
{
    # 19 February 2010
    year => 2010,
    month => 2,
    date => 19,
    wday => 5,
},
# Monday 19th March 2012.
{
    year => 2012,
    month => 3,
    date => 19,
    wday => 1,
},), "\n";

produces output

Friday 19 February, 2010-Monday 19 March, 2012

(This example is included as default-make-date-interval.pl in the distribution.)

This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as "default_make_date".

This function is not exported. It is the default used by "subsjdate". You can use another function instead of this default by supplying a value make_date_interval as a callback in "subsjdate".

BUGS

The following special cases are not covered.

Doesn't do 元日 (ganjitsu)

This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.

10月第4月曜日

"10月第4月曜日", which means "the fourth Monday of October", comes out as "October第April曜日".

今年6月

The module does not handle things like 今年 (this year), 去年 (last year), or 来年 (next year).

末日

The module does not handle "末日" (matsujitsu) "the last day" (of a month).

土日祝日

The module does not handle "土日祝日" (weekends and holidays).

年末年始

The module does not handle "年末年始" (the new year period).

Please also note the following:

Minimal sanity check of Japanese era dates

It does not detect that dates like 昭和99年 (Showa 99, an impossible year, since Showa 63 (1988) was succeeded by Heisei 1 (1989)) are invalid. It does, however, only allow two digits for these named-era dates.

Only goes back to Meiji

The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.

Doesn't find dates in order

For those supplying their own callback routines, note that the dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones.

UTF-8 version only

This module only understands Japanese encoded in Perl's internal form (UTF-8).

Trips a bug in Perl 5.10

If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.

EXPORTS

This module exports one function, "subsjdate", on request.

SEE ALSO

DateTime::Locale::JA

Minimal selection of Japanese date functions. It's not complete enough to deal with the full range of dates in actual documents.

DateTime::Format::Japanese

This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times.

Lingua::JA::Numbers

Kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.

DateTime::Calendar::Japanese::Era

A full set of Japanese eras.

AUTHOR

Ben Bullock, <bkb@cpan.org>

Request

If you'd like to see this module continued, let me know that you're using it. For example, send an email, write a bug report, star the project's github repository, add a patch, add a ++ on Metacpan.org, or write a rating at CPAN ratings. It really does make a difference. Thanks.

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2008-2016 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.