NAME

Encode::Supports -- Supported encodings by Encode

DESCRIPTION

Encoding Names

Encoding names are case insensitive. White space in names is ignored. In addition an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking he first in the following sequence:

o The MIME name as defined in IETF RFCs.
o The name in the IANA registry.
o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case encodings have state, "Encode" uses the encoding object internally once an operation is in progress.

Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrance of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases. Encode.pm will automatically load those modules in need.

Built-in Encodings

The following encodings are always available.

Canonical	Aliases                      Comments & References
----------------------------------------------------------------
iso-8859-1	latin1					     [ISO]
US-ascii	ascii					    [ECMA]
UCS-2		ucs2, iso-10646-1	             [IANA, et al]
UCS-2l
UTF-8		utf8					 [RFC2279]
----------------------------------------------------------------

Encode::Byte

The following encodings are based single-byte encoding implemented as extended ASCII. For most cases it uses \x80-\xff (upper half) to map non-ASCII characters.

----------------------------------------------------------------
# ISO 8859 series
(iso-8859-1	is in built-in)
iso-8859-2	latin2					     [ISO]
iso-8859-3	latin3					     [ISO]
iso-8859-4	latin4					     [ISO]
iso-8859-5						     [ISO]
iso-8859-6						     [ISO]
iso-8859-7						     [ISO]
iso-8859-8						     [ISO]
iso-8859-9	latin5					     [ISO]
iso-8859-10	latin6					     [ISO]
iso-8859-11
(iso-8859-12 is nonexistent)
iso-8859-13   latin7					     [ISO]
iso-8859-14	latin8					     [ISO]
iso-8859-15	latin9					     [ISO]
iso-8859-16	latin10					     [ISO]

# Cyrillic
koi8-f					
koi8-r						 [RFC1489]
koi8-u						 [RFC2319]

# Vietnamese
viscii

# all cp* are also available as ibm-*, ms-*, and windows-*
# also see L<http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.asp>
cp1250	WinLatin2
cp1251	WinCyrillic
cp1252	WinLatin1
cp1253	WinGreek
cp1254	WinTurkiskh
cp1255	WinHebrew
cp1256	WinArabic
cp1257	WinBaltic
cp1258	WinVietnamese

# Macintosh
# Also see L<http://developer.apple.com/technotes/tn/tn1150.html>
MacCentralEurRoman
MacCroatian
MacRoman
MacCyrillic
MacRomanian
MacSami
MacGreek 
MacThai
MacIcelandic    
MacTurkish
MacUkrainian

# More vendor encodings
nextstep
gsm0338	# used in GSM handsets
hp-roman8
----------------------------------------------------------------

The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above. Also read "Encoding vs Charset" below. Also note these are implemented in distinct module by languages, due the the size concerns. Please also refer to their respective document pages.

Encode::CN -- Continental China
----------------------------------------------------------------
cp936      gbk		    
euc-cn     gb2312
gb12345-raw
gb2312-raw
hz
iso-ir-165
----------------------------------------------------------------
Encode::JP -- Japan
----------------------------------------------------------------
7bit-jis	  jis
cp932		  ms_Kanji
euc-jp	  ujis
iso-2022-jp						 [RFC1468]
iso-2022-jp-1						 [RFC2237]
macJapan
shiftjis	  Shift_JIS, sjis
----------------------------------------------------------------
Encode::KR -- Korea
----------------------------------------------------------------
euc-kr
cp949		ks_c_5601-1987 x-windows-949 uhc
iso-2022-kr					         [RFC1557]
johab
ksc5601-raw
----------------------------------------------------------------
Encode::TW -- Taiwan
----------------------------------------------------------------
big5
big5-hkscs
cp950
----------------------------------------------------------------
Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are distributed separately on CPAN, under the name Encode::HanExtra.

----------------------------------------------------------------
gb18030
euc-tw
big5plus
----------------------------------------------------------------

Miscellaneous encodings

Encode::EBCDIC

See perlebcdic for details.

----------------------------------------------------------------
cp1047
cp37
posix-bc
----------------------------------------------------------------
Encode::Symbols

For symbols and dingbats.

----------------------------------------------------------------
symbol
dingbats
macDingbats
----------------------------------------------------------------

Unsupported encodings

The following are not supported as yet. Some because they are rarely usede, some because of technical difficulty. They may be supported by external modules via CPAN in future, however.

ISO-2022-JP-2 [RFC1554]

Not very popular yet. Needs Unicode Database or equivalent to implement encode() (Because it includes JIS X 0208/0212, KSC5601, and GB2312 sumulteniously, which code points in unicode overlap. So you need to lookup the database to determine what character set a given Unicode character should belong).

ISO-2022-CN [RFC1922]

Not very popular. Needs CNS 11643-1 and 2 which are not available in this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. Autrijus may add support for this encoding in his module in future

various UP-UX encodings

The following are unsoported due to the lack of mapping data.

'8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
'15' - japanese15, korean15, and  roi15
Cyrillic encoding ISO-IR-111

Anton doubts its usefulness.

ISO-8859-8-1 [Hebrew]

None of the Encode team knows Hebrew enough. Contribution welcome.

Thai encoding TCVN

Ditto.

Vietnamese encodings VPS

Ditto.

various Mac encodings

The following are unsoported due to the lack of mapping data. "Mac" that prepends the encoding names are omitted.

Arabic, Armenian, Bengali, Burmese
ChineseSimp, ChineseTrad, Devanagari, Ethiopic, ExtArabic
Farsi, Georgian, Gujarati, Gurmukhi, Hebrew
Kannada, Khmer, Korean, Laotian, Malayalam, Mongolian
Oriya Sinhalese Symbol Tamil Telugu Tibetan Vietnamese

The rest of which already available are based upon the vendor mapping available at http://www.unicode.org/

Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just "charset") are often used interchangeably but they are different concepts.

Character Set (charset for short)

Is a collection of characters in which each character is distinguished with unique ID (in most cases, ID is number).

Character Encoding

Is a way to represent character set(s) in a stream of bits.

A character encoding may contain a single character set (i.e. US-ascii) or multiple character sets (i.e. EUC-JP; US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).

A character encoding may also encode character set as-is (also called a raw encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

As the name suggests, the Encode module supports encodings, not individual charsets.

However, the word charset is casually used even in Internet Assigned Number Authority to actually mean encoding. Encode tries to soothe this misconception via aliases. For instance, gb2312 is aliased to euc-cn, while "raw" encoded version is available as gb2312-raw.

Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their applicability for information exchange over the Internet and to choose the most suitable aliases to name them in the context of such communication.

  • To (en|de) code Encodings marked as *, You need Encode::HanExtra ,available from CPAN.

Encoding names

US-ASCII    UTF-8     ISO-8859-*  KOI8-R
Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
EUC-KR      Big5

are registered to IANA as preferred MIME names and may probably be used over the Internet.

Shift_JIS is no longer Microsft proprietary since it has been officialized by JIS X 0208-1997.

EUC-CN

has not been registered with IANA (as of march 2002) but seems to be supported by major web browsers. In Encode, GB2312 is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized as gb2312-raw. See Encode::CN for details.

KS_C_5601-1987

has been registered to IANA but when they are used, they are EUC-coded. Internet community in Korea is not happy with this. so KS_C_5601-1987 is aliased to cp949, an enhanced version of euc-kr, with ksc5601-raw for "uncooked".

UTF-16 
KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (UTF-16 even as a preferred MIME name) but probably should be avoided as encoding for web pages due to the lack of browser supports.

ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
GBK
VISCII
GB 12345
GB 18030 (*)  (see links bellow)
EUC-TW   (*)

are totally valid encodings but not registered at IANA. The names under which they are listed here are probably the most widely-known names for these encodings and are recommended names.

BIG5PLUS (*)

is a bit proprietary name.

Bookmarks

Assigned Charset Names by IANA

http://www.iana.org/assignments/character-sets

Most of the canonical names in Encode derive from this list so you can directly apply the string you have extracted from MIME header of mails and we pages.

CJK.inf

http://www.oreilly.com/people/authors/lunde/cjk_inf.html

Somewhat obsolete (last update in 1996), but still useful. Also try

ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf

You will find brief info on EUC-CN, GBK and mostly on GB 18030

EMCA-035 (eq ISO-2022)

http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM

The very dspecification of ISO-2022 is available from the link above.

See Also

Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW, Encode::EBCDIC, Encode::Symbol