NAME
Search::Tools::UTF8 - UTF8 string wrangling
SYNOPSIS
use Search::Tools::UTF8;
my $str = 'foo bar baz';
print "bad UTF-8 sequence: " . find_bad_utf8($str)
unless is_valid_utf8($str);
print "bad ascii byte at position " . find_bad_ascii($str)
unless is_ascii($str);
print "bad latin1 byte at position " . find_bad_latin1($str)
unless is_latin1($str);
DESCRIPTION
Search::Tools::UTF8 supplies common UTF8-related functions.
FUNCTIONS
is_valid_utf8( text )
Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).
is_ascii( text )
If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.
is_latin1( text )
Returns true if text lies within the Latin1 charset.
NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.
CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:
my $str = "\x{d9}\x{a6}"; # same as \x{666}
is_valid_utf8($str); # returns true
is_latin1($str); # returns true
Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.
is_flagged_utf8( text )
Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().
is_perl_utf8_string( text )
Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().
is_sane_utf8( text [,warnings] )
Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:
Strings that are not utf8 always automatically pass.
Pass a second true param to get diagnostics on stderr.
find_bad_utf8( text )
Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:
croak "bad bytes: " . find_bad_utf8($str)
unless is_valid_utf8($str);
If text is a valid UTF-8 string, returns undef.
find_bad_ascii( text )
Returns position of first non-ASCII byte or -1 if text is all ASCII.
find_bad_latin1( text )
Returns position of first non-Latin1 byte or -1 if text is valid Latin1.
find_bad_latin1_report( text )
Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.
to_utf8( text, charset )
Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.
Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().
BUGS
AUTHOR
Peter Karman perl@peknet.com
Thanks to Atomic Learning www.atomiclearning.com
for sponsoring the development of this module.
Many of the UTF-8 tests come directly from Test::utf8.
COPYRIGHT
Copyright 2007 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
Search::Tools, Encode, Test::utf8