NAME

Encode::HanConvert - Traditional and Simplified Chinese mappings

SYNOPSIS

use Encode::HanConvert; # needs perl 5.7.2 or better

# Conversion between Chinese encodings
$euc_cn = big5_to_gb($big5); # Big5 to GB2312 (EUC-CN encoding)
$big5 = gb_to_big5($euc_cn); # GB2312 to Big5

# Conversion between Perl's Unicode strings
$simp = trad_to_simp($trad); # Traditional to Simplified
$trad = simp_to_trad($trad); # Simplified to Traditional

# Conversion between Chinese encoding and Unicode strings
$simp = big5_to_simp($big5); # Big5 to Simplified
$big5 = simp_to_big5($simp); # Simplified to Big5
$trad = gb_to_trad($euc_cn); # GB2312 to Traditional
$euc_cn = trad_to_gb($trad); # Traditional to GB2312

# For completeness' sake... (no conversion, just encode/decode)
$simp = gb_to_simp($euc_cn); # GB2312 to Simplified
$euc_cn = simp_to_gb($simp); # Simplified to GB2312
$trad = big5_to_trad($big5); # Big5 to Traditional
$big5 = trad_to_big5($trad); # Traditional to Big5

# All functions may be used in void context to transform $_[0]
big5_to_gb($string); # transform $string from big5 to gb

# Drop-in replacement functions for Lingua::ZH::HanConvert
use Encode::HanConvert qw(trad simple); # not exported by default

$simp = simple($trad); # Traditional to Simplified
$trad = trad($trad);   # Simplified to Traditional

DESCRIPTION

This module is an attempt to solve most common problems occured in Traditional vs. Simplified Chinese conversion, in an efficient, flexible way, without resorting to external tools or modules.

After installing this module, you'll have two additional encoding formats: 'big5-simp' maps Big5 into Unicode's Simplified Chinese (and vice versa), and 'euc-cn-trad' maps EUC-CN (better known as GB2312) into Unicode's Traditional Chinese and back.

The module exports various xxx_to_yyy functions by default, where xxx and yyy are one of big5, gb (euc-cn), simp (simplified Chinese unicode), or trad (traditional Chinese unicode).

You may also import simple and trad, which are aliases for simp_to_trad and trad_to_simp; this is provided as a drop-in replacement for programs using Lingua::ZH::HanConvert.

Since this is built on Encode's architecture, you may also use the line discipline syntax to perform the conversion implicitly:

require Encode::CN;
open BIG5, ':encoding(big5-simp)', 'big5.txt';     # as simplified
open EUC,  '>:encoding(euc-cn)',   'euc-cn.txt';   # as euc-cn
print EUC, <BIG5>;

require Encode::TW;
open EUC,  ':encoding(euc-cn-trad)', 'euc-cn.txt'; # as traditional
open BIG5, '>:encoding(big5)',       'big5.txt';   # as big-5
print BIG5, <EUC>;

Or, more interestingly:

use encoding 'big5-simp';
print "¤¤¤å"; # prints simplified chinese in unicode

COMPARISON

Although Lingua::ZH::HanConvert module already provides mapping between Simplified and Traditional Unicode characters, it depend on other modules (Text::Iconv or Encode) to provide the necessary mapping with Big5 and GB2312 (euc-cn) encodings.

Also, Encode::HanConvert loads up much faster:

0.148u 0.046s 0:00.19 94.7% # Encode::HanConvert
7.096u 0.015s 0:07.23 98.2% # Lingua::ZH::HanConvert (v0.12)

The difference in actual conversion is much higher. Use 32k text of trad=>simp as an example:

 0.082u 0.031s 0:00.12 91.6% # iconv | b2g | iconv
 0.324u 0.015s 0:00.35 94.2% # Encode::HanConvert
23.715u 0.054s 0:24.51 96.9% # Lingua::ZH::HanConvert (v0.12)

The b2g above refers to Yeung and Lee's HanZi Converter, an external utility that maps big5 to gb2312 and back; iconv refers to GNU libiconv. If you don't mind the overhead of calling external process, their result is nearly identical with this module.

CAVEATS

This module does not preserve one-to-many mappings; it blindly chooses the most frequently used substitutions, instead of presenting the user multiple choices. This can be remedied by a dictionary-based post processor that restores the correct character.

SEE ALSO

Encode, Lingua::ZH::HanConvert, Text::Iconv

The b2g.pl and g2b.pl utilities installed with this module.

ACKNOWLEDGEMENTS

The conversion table used in this module comes from various sources, including Lingua::ZH::HanConvert by David Chan, and hc by Ricky Yeung & Fung F. Lee.

The *.enc files are checked against test files generated by GNU libiconv with kind permission from Bruno Haible. The compile and encode.h are lifted from the Encode distribution, which is part of the standard perl distribution.

Kudos to Nick Ing-Simmons, Dan Kogai and Jarkko Hietaniemi for showing me how to use Encode and PerlIO. Thanks!

AUTHORS

Autrijus Tang <autrijus@autrijus.org>

COPYRIGHT

Copyright 2002 by Autrijus Tang <autrijus@autrijus.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

1 POD Error

The following errors were encountered while parsing the POD:

Around line 182:

Non-ASCII character seen before =encoding in '"¤¤¤å";'. Assuming CP1252