NAME

Encode::Supports -- Supported encodings by Encode

DESCRIPTION

Encoding Names

Encoding names are case insensitive. White space in names is ignored. In addition an encoding may have aliases. Each encoding has one "canonical" name. The "canonical" name is chosen from the names of the encoding by picking he first in the following sequence:

o The MIME name as defined in IETF RFCs.
o The name in the IANA registry.
o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case encodings have state, "Encode" uses the encoding object internally once an operation is in progress.

Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized. Note that unless otherwise specified, they are all case insensitive (via alias) and all occurrance of spaces are replaced with '-'. In other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules but you don't have to use Encode::XX to make them available for most cases. Encode.pm will automatically load those modules in need.

Built-in Encodings

The following encodings are always available.

Canonical	Aliases
-----------------------
iso-8859-1	latin1
US-ascii	ascii
UCS-2		ucs2, iso-10646-1
UCS-2le
UTF-8		utf8
-----------------------

Encode::Byte

The following encodings are based single-byte encoding implemented as extended ASCII. For most cases it uses \x80-\xff (upper half) to map non-ASCII characters.

-----------------------
(iso-8859-1	is in built-in)
iso-8859-2	latin2
iso-8859-3	latin3
iso-8859-4	latin4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9	latin5
iso-8859-10	latin6
iso-8859-11
(iso-8859-12 is nonexistent)
iso-8859-13   latin7
iso-8859-14	latin8
iso-8859-15	latin9
iso-8859-16	latin10

koi8-f
koi8-r
koi8-u

viscii	# ASCII + vietnamese

cp1250	WinLatin2
cp1251	WinCyrillic
cp1252	WinLatin1
cp1253	WinGreek
cp1254	WinTurkiskh
cp1255	WinHebrew
cp1256	WinArabic
cp1257	WinBaltic
cp1258	WinVietnamese
# all cp* are also available as ibm-* and ms-*

maccentraleuropean  
maccroatian
macroman
maccyrillic
macromanian
macsami
macgreek 
macthai
macicelandic    
macturkish
macukraine

nextstep
gsm0338	# used in GSM handsets
roman8	# what is this?
-----------------------

The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above. Also read "Encoding vs Charset" below. Also note these are implemented in distinct module by languages, due the the size concerns. Please also refer to their respective document pages.

Encode::CN -- Continental China
-----------------------
cp936      gbk		    
euc-cn
gb12345
gb2312
hz
iso-ir-165
-----------------------
Encode::JP -- Japan
-----------------------
7bit-jis	  jis
cp932
euc-jp	  ujis
iso-2022-jp
iso-2022-jp-1
macjapan
shiftjis	  Shift_JIS, sjis
-----------------------
Encode::KR -- Korea
-----------------------
euc-kr
ksc5601
cp949
-----------------------
Encode::TW -- Taiwan
-----------------------
big5
big5-hkscs
cp950
-----------------------
Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are distributed separately on CPAN, under the name Encode::HanExtra.

-----------------------
gb18030
euc-tw
big5plus
-----------------------

Miscellaneous encodings

Encode::EBCDIC

See perlebcdic for details.

-----------------------
cp1047
cp37
posix-bc
-----------------------
Encode::Symbols

For symbols and dingbats.

-----------------------
symbol
dingbats
macdingbats
-----------------------

Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just "charset") are often used interchangeably but they are different concepts.

Charset determines which characters to be included in a given text.

Encoding actually maps charset(s) to stream of bits.

Note a given encoding may contain multiple charsets and complex CJK encodings are usually implemented that way.

For instance, euc-jp contains ASCII, JIS X 0201-1978 (Hankaku Kana), JIS X 0208-1997 (ZenkakuKana and Kanji) and JIS X 0212-1990 (Extended Kanji) in a single encoding.

As the name suggests, the Encode module supports encodings, not individual charsets.

Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their applicability for information exchange over the Internet and to choose the most suitable aliases to name them in the context of such communication.

Encoding names

US-ASCII    UTF-8       
ISO-8859-*  KOI8-R
Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
EUC-KR 
Big5

are http://www.iana.org/assignments/character-sets-registered as preferred MIME names and may probably be used over the Internet.

Shift_JIS is no longer Microsft proprietary since it has been officialized by JIS X 0208-1997. It is probably the most wide spread encoding for Japanese on the Internet.

EUC-CN

has not been registered with IANA (as of march 2002) but seems to be supported by major web browsers. (IANA has registered this encoding as GB2312, but gb2312 currently has a different meaning to the Encode module. It will probably become alias to EUC-CN in the future; until then it is safer to avoid using gb2312 as encoding name within Perl).

UTF-16 
KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (UTF-16 even as a preferred MIME name) but probably should be avoided as encoding for web pages due to lack of browser support.

ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
GBK
VISCII
GB 12345
GB 18030 (*)  (see links bellow)
EUC-TW   (*)

are totally valid encodings but not registered at IANA. The names under which they are listed here are probably the most widely-known names for these encodings and are recommended names.

do not work @15457 when it's clear they will be uncommented or deleted - Anton ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) CNS 11643 (only plains 1 and 2 available)

BIG5PLUS (*)

is a bit proprietary name. (*)-marked encodings belong to Encode::HanExtra available from CPAN.

You may probably get some info on CJK encodings at

brief description for most of the mentioned CJK encodings http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html

several years old, but still useful http://www.oreilly.com/people/authors/lunde/cjk_inf.html

and some in-depth reading for the heroes :-) http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022)

gives brief info on EUC-CN, GBK and mostly on GB 18030 ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf

The nature of information in this section is most fragile and error-prone; probably is the most popular adverb :) Please feel free to send your comments, disagreements and additions to .... (Note however, that the mission of this document is to cover the Encode-supported encodings only.

See Also

Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW, Encode::EBCDIC, Encode::Symbol