NAME

ShiftJIS::Collate - collation of Shift_JIS strings

SYNOPSIS

use ShiftJIS::Collate;

@sorted = ShiftJIS::Collate->new(%tailoring)->sort(@source);

ABOUT THIS POD

This POD is written in Shift_JIS.

Do you see '' as HIRAGANA LETTER A? or '\' as YEN SIGN, not as REVERSE SOLIDUS? Otherwise you'd change your font to an appropriate one. (or the POD might be badly converted.)

DESCRIPTION

This module provides some functions to compare and sort strings in Shift_JIS based on the collation of Japanese character strings.

This module is an implementation of JIS X 4061:1996 and the collation rules are based on that standard. See "Conformance to the Standard".

Constructor and Tailoring

The new method returns a collator object.

$Collator = ShiftJIS::Collate->new(
   ignoreChar => $regexIgnoredChar,
   kanji => $kanji_class,
   katakana_before_hiragana => $bool,
   level => $collationLevel,
   position_in_bytes => $bool,
   tounicode  => \&sjis_to_unicode,
   preprocess => \&preprocess,
   upper_before_lower => $bool,
);
# if %tailoring is false (empty),
# $Collator should do the default collation.
ignoreChar

If specified as a regular expression, any characters that match it are ignored on collation.

e.g. If you want to ignore KATAKANA-HIRAGANA PROLONGED SOUND MARK and its halfwidth form, say

ignoreChar => '^(?:\x81\x5B|\xB0)',
katakana_before_hiragana

By default, hiragana is before katakana.

If the parameter is true, this is reversed.

kanji

Set the kanji class. See "Kanji Classes".

Level 1: 'saisho' (minimal)
Level 2: 'kihon' (basic)
Level 3: 'kakucho' (extended)

The kanji class is specified as 1, 2, or 3. If omitted, class 2 is applied.

This module does not provide collation of 'kakucho' kanji class since the repertory Shift_JIS does not define all the Unicode CJK unified ideographs.

But if the kanji class 3 is specified, you can collate kanji in the unicode order. In this case you must provide tounicode coderef which gives a unicode codepoint from a Shift_JIS character.

level

Set the maximum level. See "Collation Levels". Any higher levels than the specified one are ignored.

Level 1: alphabetic ordering
Level 2: diacritic ordering
Level 3: case ordering
Level 4: script ordering
Level 5: width ordering

The collation level is specified as a number between 1 and 5. If omitted, level 4 is applied.

tounicode

If you want to collate kanji in the unicode order, specify a coderef which gives a unicode codepoint from a Shift_JIS character.

Such a subroutine should map a string comprising of a kanji of level 1 and 2 in Shift_JIS to a codepoint in the range between 0x4E00 and 0x9FFF.

position_in_bytes

By default, the index method returns its results in characters.

If this parameter is true, it returns the results in bytes.

preprocess

If specified, the coderef is used to preprocess before the formation of sort keys.

upper_before_lower

By default, lowercase is before uppercase.

If the parameter is true, this is reversed.

Comparison

$result = $Collator->cmp($a, $b)

Returns 1 (when $a is greater than $b) or 0 (when $a is equal to $b) or -1 (when $a is lesser than $b).

$result = $Collator->eq($a, $b)
$result = $Collator->ne($a, $b)
$result = $Collator->gt($a, $b)
$result = $Collator->ge($a, $b)
$result = $Collator->lt($a, $b)
$result = $Collator->le($a, $b)

They works like the same name operators as theirs.

Sorting

$sortKey = $Collator->getSortKey($string)

Returns a sort key.

You compare the sort keys using a binary comparison and get the result of the comparison of the strings.

$Collator->getSortKey($a) cmp $Collator->getSortKey($b)

   is equivalent to

$Collator->cmp($a, $b)
@sorted = $Collator->sort(@source)

Sorts a list of strings by tanjun shogo: 'the simple collation'.

@sorted = $Collator->sortYomi(@source)

Sorts a list of references to arrays of (spell, reading) by yomi-hyoki shogo: 'the collation using readings and spells'.

E.g., an element of @source is probably ['“ú–{Œê', '‚É‚Ù‚ñ‚²']; Its opposite, ['‚É‚Ù‚ñ‚²', '“ú–{Œê'], is also allowed, though.

Yomi-hyoki shogo is carried out through two comparison stages.

E.g., sort these strings by 'Yomi-hyoki shogo'.

['‰i“c', '‚È‚ª‚½'], ['�¬ŽR', '‚¨‚â‚Ü'], ['’·“c', '‚¨‚³‚¾'], ['’·“c', '‚È‚ª‚½'], ['�¬ŽR', '‚±‚â‚Ü'].

First, order by reading ('‚¨‚³‚¾' < '‚¨‚â‚Ü' < '‚±‚â‚Ü' < '‚È‚ª‚½'); next, order by spelling among strings having the same reading ('‰i“c' < '’·“c' where both are read as '‚È‚ª‚½').

The result should be ['’·“c', '‚¨‚³‚¾'] < ['�¬ŽR', '‚¨‚â‚Ü'] < ['�¬ŽR', '‚±‚â‚Ü'] < ['‰i“c', '‚È‚ª‚½'] < ['’·“c', '‚È‚ª‚½'].

See also sample/yomi.txt.

@sorted = $Collator->sortDaihyo(@source)

Sorts a list of references to arrays of (spell, reading) by kan'i-daihyo-yomi shogo: 'the simplified representative reading collation'.

kan'i-daihyo-yomi shogo is carried out through five comparison stages. This ordered list is an example of the result of "kan'i-daihyo-yomi shogo".

['‚S–Ê‘Ì', '‚µ‚ß‚ñ‚½‚¢'],
['‚Q�F�«', '‚É‚µ‚å‚­‚¹‚¢'],
['‚SŽŸŒ³', '‚悶‚°‚ñ'],
['‚U–Ê‘Ì', '‚ë‚­‚ß‚ñ‚½‚¢'],
['ƒ¿•ö‰ó', 'ƒAƒ‹ƒtƒ@‚Ù‚¤‚©‚¢'],
['ƒ¡ŠÖ�”', 'ƒKƒ“ƒ}‚©‚ñ‚·‚¤'],
['ƒÀ�ü',   'ƒx�[ƒ^‚¹‚ñ'],
['‚p’l',   'ƒLƒ…�[‚¿'],
['‚i‚h‚r', '‚¶‚·'],
['Perl',   'ƒp�[ƒ‹'],
['‰Í�¼',   '‚©‚³‚¢'],
['‰Í�‡',   '‚©‚í‚¢'],
['‰Í“c',   '‚©‚킾'],
['‰Í“à',   '‚©‚í‚¿'],
['‰Í•Ó',   '‚©‚í‚×'],
['Šp“c',   '‚©‚­‚½'],
['Šp“c',   '‚©‚Ç‚½'],
['ŠÖ“Œ',   '‚©‚ñ‚Æ‚¤'],
['‰Í“à',   '‚±‚¤‚¿'],
['‘ò“‡',   '‚³‚킵‚Ü'],
['‘ò“ˆ',   '‚³‚킵‚Ü'],
['‘ò“c',   '‚³‚킾'],
['àV“‡',   '‚³‚킵‚Ü'],
['àV“ˆ',   '‚³‚킵‚Ü'],
['àV“c',   '‚³‚킾'],
['Šp“c',   '‚‚̂¾'],
['“yˆä',   '‚‚¿‚¢'],
['“y‹�',   '‚‚¿‚¢'],
['“yˆä',   '‚Ç‚¢'],
['“y‹�',   '‚Ç‚¢'],

(1) Compare the character class of the first character of the spell.

Digit class ('‚S–Ê‘Ì') < Greek class ('ƒ¿•ö‰ó') < Latin class ('‚i‚h‚r') < Kanji class ('ŠÖ“Œ').

(2) Compare the first character of the reading.

e.g. '‚µ‚ß‚ñ‚½‚¢' < '‚É‚µ‚å‚­‚¹‚¢' < '‚悶‚°‚ñ' < '‚ë‚­‚ß‚ñ‚½‚¢'.

(3) Compare the first character of the spell.

e.g. ('‰Í�¼','‰Í“c',etc.) < ('Šp“c','Šp“c') < ('ŠÖ“Œ');

     ('‘ò“‡','‘ò“ˆ','‘ò“c') < ('àV“‡','àV“ˆ','àV“c').

(4) Compare the whole string of the reading.

e.g. ['‰Í�¼', '‚©‚³‚¢'] < ['‰Í�‡', '‚©‚í‚¢'] < ['‰Í“c', '‚©‚킾'];

     ['Šp“c', '‚©‚­‚½'] < ['Šp“c', '‚©‚Ç‚½'].

(5) Compare the whole string of the spell.

e.g. ['‘ò“‡', '‚³‚킵‚Ü'] < ['‘ò“ˆ', '‚³‚킵‚Ü'] < ['‘ò“c', '‚³‚킾'].

See also sample/daihyo.txt.

Searching

$position = $Collator->index($string, $substring)
($position, $length) = $Collator->index($string, $substring)

If $substring matches a part of $string, returns the position of the first occurrence of the matching part in scalar context; in list context, returns a two-element list of the position and the length of the matching part.

Notice that the length of the matching part may differ from the length of $substring.

If $substring does not match any part of $string, returns -1 in scalar context and an empty list in list context.

e.g. you say

use ShiftJIS::Collate;
use ShiftJIS::String qw(substr);

my $Col = ShiftJIS::Collate->new( level => $level );
my $str = "* ‚Ђ炪‚ȂƃJƒ^ƒJƒi‚̓Œƒxƒ‹‚R‚Å‚Í“™‚µ‚¢‚©‚È�B";
my $sub = "‚©‚È";
my $match;
if (my @tmp = $Col->index($str, $sub)) {
  $match = substr($str, $tmp[0], $tmp[1]);
}

If $level is 1, you get "‚ª‚È"; if $level is 2 or 3, you get "ƒJƒi"; if $level is 4 or 5, you get "‚©‚È".

If your substr function is not oriented to Shift_JIS, specify true as position_in_bytes. See "Constructor and Tailoring".

NOTE

Collation Levels

The following criteria are considered in order until the collation order is determined. By default, Levels 1 to 4 are applied and Level 5 is ignored (as JIS does NOT specify level 5).

Level 1: alphabetic ordering.

The character class early appeared in the following list is smaller.

Space characters, Symbols and Punctuations, Digits, Greek Letters,
Cyrillic Letters, Latin letters, Kana letters, Kanji ideographs,
and Geta mark.

In the class, alphabets are collated alphabetically; kana letters are AIUEO-betically (in the Gozyuon order, 'ŒÜ�\‰¹�‡'); kanji are in the JIS X 0208 order.

Characters that do not belong to any character class are ignored and skipped for collation.

Geta mark ('�¬', 0x81AC, U+3013) is the greatest character (ordered at the last).

Any character for which the order is no defined, like control characters, box drawings, unassigned characters, etc. is regarded as a completely ignorable character.

Level 2: diacritic ordering.

In kana, the order is as shown the following list.

A voiceless kana, the voiced, then the semi-voiced (if exists).
 (eg. '‚©' < '‚ª'; '‚Í' < '‚Î' < '‚Ï')
Level 3: case ordering.

A small Latin is lesser than the corresponding Capital.

In kana, the order is as shown the following list. see "Replacement of PROLONGED SOUND MARK and ITERATION MARKs".

Replaced PROLONGED SOUND MARKs (U+30FC and U+FF70);
Small Kana;
Replaced ITERATION MARKs (U+309D, U+309E, U+30FD, and U+30FE);
then, Normal Kana.

Then, e.g., '‚ �[' < '‚ ‚Ÿ' < '‚ �T' < '‚ ‚ '.
Level 4: script ordering.

Any hiragana is lesser than the corresponding katakana.

Then, e.g., '‚ ' < 'ƒA'.
Level 5: width ordering.

A character that belongs to the block Halfwidth and Fullwidth Forms is greater than the corresponding normal character.

BN: JIS does not mention this level. Level 5 is an extention by this module.

Kanji Classes

There are three kanji classes. This modules provides the Classes 1 and 2.

Class 1: the 'saisho' (minimal) kanji class

It comprises five kanji-like characters, i.e. '�V' (0x8156, U+3003), '�X' (0x8158, U+3005), '�W' (0x8157, U+4EDD), '�Y' (0x8159, U+3006), and '�Z' (0x815A, U+3007).

Any kanji except '�W' are ignored on collation.

Class 2: the 'kihon' (basic) kanji class

It comprises JIS level 1 and 2 kanji in addition to the minimal kanji class. Sorted in the JIS codepoint order. Any kanji excepting those defined by JIS X 0208 are ignored on collation.

Class 3: the 'kakucho' (extended) kanji class

All the CJK Unified Ideographs in addition to the minimal kanji class. Sorted in the Unicode codepoint order.

Replacement of PROLONGED SOUND MARKs and ITERATION MARKs

Character  SJIS    UCS     Name

  '�['    0x815B  U+30FC  KATAKANA-HIRAGANA PROLONGED SOUND MARK
  '°'     0xB0    U+FF70  HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
  '�T'    0x8154  U+309D  HIRAGANA ITERATION MARK
  '�U'    0x8155  U+309E  HIRAGANA VOICED ITERATION MARK
  '�R'    0x8152  U+30FD  KATAKANA ITERATION MARK
  '�S'    0x8153  U+30FE  KATAKANA VOICED ITERATION MARK

These characters, if replaced, are secondary equal to the replacing kana, while ternary not equal to.

KATAKANA-HIRAGANA PROLONGED SOUND MARKs

The PROLONGED MARKs (including the halfwidth equivalent) are repleced to a normal vowel or nasal katakana corresponding to the preceding kana if exists.

e.g.,

'ƒJ�['   to 'ƒJƒA'
'‚Ñ�['   to '‚уC'
'‚«‚á�[' to '‚«‚áƒA'
'ƒsƒ…�[' to 'ƒsƒ…ƒE'
'ƒ“�['   to 'ƒ“ƒ“'
'‚ñ�[�[' to '‚ñƒ“ƒ“'
HIRAGANA- and KATAKANA ITERATION MARKs

The ITERATION MARKs (VOICELESS) are repleced to a normal kana corresponding to the preceding kana if exists.

e.g.,

'‚©�T'   to '‚©‚©'
'ƒh�T'   to 'ƒh‚Æ'
'‚ñ�T'   to '‚ñ‚ñ'
'ƒJ�R'   to 'ƒJƒJ'
'‚Î�R'   to '‚΃n'
'ƒv�R'   to 'ƒvƒt'
'ƒ”ƒB�R' to 'ƒ”ƒBƒC'
'ƒsƒ…�T' to 'ƒsƒ…‚ä'
HIRAGANA- and KATAKANA VOICED ITERATION MARKs

The VOICED ITERATION MARKs are repleced to a voiced kana corresponding to the preceding kana if exists.

e.g.,

'‚Í�U' to '‚Í‚Î'
'ƒv�U' to 'ƒv‚Ô'
'ƒv�S' to 'ƒvƒu'
'‚±�S' to '‚±ƒS'
'ƒE�S' to 'ƒEƒ”'
Cases of no replacement

Otherwise, no replacement occurs. Especially in the cases when these marks follow any character except kana.

The unreplaced characters are primary greater than any kana.

e.g.  CJK Ideograph followed by PROLONGED SOUND MARK
      Digit followed by ITERATION MARK
      'ƒA�S' ('ƒA' has no voiced variant)
Example

For example, the Japanese string 'ƒp�[ƒ‹' (Perl in kana) has three collation elements: KATAKANA PA, PROLONGED SOUND MARK replaced by KATAKANA A, and KATAKANA RU.

e.g.,

 'ƒp�[ƒ‹' is converted to 'ƒpƒAƒ‹' by replacement;
   primary equal to '‚Í‚ ‚é';
   condary equal to '‚Ï‚ ‚é', greater than '‚Í‚ ‚é';
   tertiary equal to '‚Ï�[‚é', lesser than 'ƒpƒAƒ‹';
   and quartenary greater than '‚Ï�[‚é'.

Conformance to the Standard

  [the clause 6.2, JIS X 4061]

(1) charset: Shift_JIS.

(2) No limit of the number of characters in the string considered
    to collate.

(3) No character class is added.

(4) The following characters are added as collation elements.

    IDEOGRAPHIC SPACE in the space class.

    ACUTE ACCENT, GRAVE ACCENT, DIAERESIS, CIRCUMFLEX ACCENT
    in the class of descriptive symbols.

    APOSTROPHE, QUOTATION MARK in the class of parentheses.

    HYPHEN-MINUS in the class of mathematical symbols.

(5) Collation of Latin alphabets with macron and with circumflex
    is not supported.

(6) Kanji Classes:
     i) the minimal kanji class (Five kanji-like chars).
     ii) the basic kanji class (Levels 1 and 2 kanji of JIS)..

AUTHOR

Tomoyuki SADAHIRO

Copyright (C) 2001-2002. All rights reserved.

<SADAHIRO@cpan.org>

http://homepage1.nifty.com/nomenclator/perl/

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

  • JIS X 4061 [Collation of Japanese character strings]

  • JIS X 0201 [7-bit and 8-bit coded character sets for information interchange]

  • JIS X 0208 [7-bit and 8-bit double byte coded KANJI sets for information interchange]

  • JIS X 0221 [Information technology - Universal Multiple-Octet Coded Character Set (UCS) - part 1 : Architectute and Basic Multilingual Plane]. This is a translation of ISO/IEC 10646-1.

  • ShiftJIS::String

  • ShiftJIS::Regexp

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1003:

Non-ASCII character seen before =encoding in ''C<‚ >''. Assuming CP1252