NAME
uniquote - escape special characters using various quoting conventions
SYNOPSIS
uniquote [options] [ textfile ... ]
Standard options:
--version print version information and exit
--help this message
--man full manpage
--debug add some debugging output
Character mode options:
Without a specified encoding, utf8 is assumed
unless file has encoding extension.
--verbose -v show full character names like \N{EN DASH}
--hex -x use singleton \x{...} esapes instead of \N{U+XXX}
--encoding -E specify encoding for all input files
--html -H show HTML entities (add --verbose for names)
--xml -X show XML entities
Binary mode options:
--bytes -b binary file in hex
--octal -0 binary file in octal
Other options:
--endings -n place $ at EOL so trailing spaces visible
--backslash -t use backslash escapes for unprintable ASCII
--fix-newlines -l consider any Unicode linebreak sequence as EOL
--unbuffer -u flush each output line
DESCRIPTION
The uniquote program it means as a Unicode-aware replacement for programs like ol(1) and cat -v
. It converts ASCII control code and all non-ASCII code points into a quoted form such as one might use in a Perl literal.
Use --endings or -e
to cat like cat -e
and add a dollar at the end of each line so trailing spaces become apparent.
Use --backslash or -t
to show tabs and other ASCII control codes as backslash escapes.
By default, uniquote converts each such code points into the form \N{U+hex}
, making code point 962 appear as \N{U+3C2}
. The --hex option instead shows eligible points in backslash-X notation, so code point 962 would be displayed as \x{3C2}
.
The --verbose option instead displays eligible code points by name. Code point 962 would then be shown as \N{GREEK SMALL LETTER FINAL SIGMA}
.
The --xml and --html options show code point using numeric entities. Adding --verbose to --html will use named HTML entities where available.
Character Modes vs Binary Mode
To treat the file as a sequence a bytes, use --binary. This displays all bytes escaped in the form \xXX
. The other way to specify binary input uses the <--octal> option.
If you have not specified binary mode, then you are in character mode. The default encoding in character mode us not ASCII but UTF-8. If you have not specified an optional encoding with --encoding, but the filename ends with the name of an encoding that Perl recognizes, that encoding will be assumed.
Note that no matter the actual input character encoding, code points reflect the Unicode number of that code point. You can use this property to normalize input, or to check that you actually know a file's encoding. For example, you can test the same file with various 8-bit encodings like Latin1, MacRoman, and CP1252.
The default input encoding is actually utf8
; that is, Perl's permissive version of UTF-8. If you want strict UTF-8, override it.
EXAMPLES
$ perl -E 'say "ascii:\tnayeeve fassodd"' > /tmp/nf.ascii
$ perl -E 'binmode(STDOUT, "encoding(macroman)")||die; say "macroman:\tna\xEFve fa\xE7ade"' > /tmp/nf.macroman
$ perl -E 'binmode(STDOUT, "encoding(utf8)")||die; say "utf8:\tna\xEFve fa\xE7ade"' > /tmp/nf.utf8
$ perl -E 'binmode(STDOUT, "encoding(utf16)")||die; say "utf16:\tna\xEFve fa\xE7ade"' > /tmp/nf.utf16
$ perl -E 'binmode(STDOUT, "encoding(utf32)")||die; say "utf32:\tna\xEFve fa\xE7ade"' > /tmp/nf.utf32
$ perl -E 'binmode(STDOUT, "encoding(latin1)")||die; say "latin1:\tna\xEFve fa\xE7ade"' > /tmp/nf.latin1
$ perl -E 'binmode(STDOUT, "encoding(cp1252)")||die; say "cp1252:\tna\xEFve fa\xE7ade"' > /tmp/nf.cp1252
$ wc -c /tmp/nf*
23 /tmp/nf.ascii
21 /tmp/nf.cp1252
21 /tmp/nf.latin1
23 /tmp/nf.macroman
42 /tmp/nf.utf16
84 /tmp/nf.utf32
21 /tmp/nf.utf8
235 total
$ uniquote /tmp/nf.*
ascii:\N{U+09}nayeeve fassodd
cp1252:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
latin1:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
macroman:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf16:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf32:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
utf8:\N{U+09}na\N{U+EF}ve fa\N{U+E7}ade
$ uniquote --backslash --endings /tmp/nf.*
ascii:\tnayeeve fassodd$
cp1252:\tna\N{U+EF}ve fa\N{U+E7}ade$
latin1:\tna\N{U+EF}ve fa\N{U+E7}ade$
macroman:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf16:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf32:\tna\N{U+EF}ve fa\N{U+E7}ade$
utf8:\tna\N{U+EF}ve fa\N{U+E7}ade$
$ uniquote --verbose /tmp/nf.*
ascii:\N{CHARACTER TABULATION}nayeeve fassodd
cp1252:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
macroman:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf16:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf32:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
utf8:\N{CHARACTER TABULATION}na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
$ uniquote --binary /tmp/nf.*
ascii:\x09nayeeve fassodd
cp1252:\x09na\xEFve fa\xE7ade
latin1:\x09na\xEFve fa\xE7ade
macroman:\x09na\x95ve fa\x8Dade
\xFE\xFF\x00u\x00t\x00f\x001\x006\x00:\x00\x09\x00n\x00a\x00\xEF\x00v\x00e\x00 \x00f\x00a\x00\xE7\x00a\x00d\x00e\x00
\x00\x00\xFE\xFF\x00\x00\x00u\x00\x00\x00t\x00\x00\x00f\x00\x00\x003\x00\x00\x002\x00\x00\x00:\x00\x00\x00\x09\x00\x00\x00n\x00\x00\x00a\x00\x00\x00\xEF\x00\x00\x00v\x00\x00\x00e\x00\x00\x00 \x00\x00\x00f\x00\x00\x00a\x00\x00\x00\xE7\x00\x00\x00a\x00\x00\x00d\x00\x00\x00e\x00\x00\x00
utf8:\x09na\xC3\xAFve fa\xC3\xA7ade
$ uniquote --xml /tmp/nf.*
ascii:	nayeeve fassodd
cp1252:	naïve façade
latin1:	naïve façade
macroman:	naïve façade
utf16:	naïve façade
utf32:	naïve façade
utf8:	naïve façade
$ uniquote --html /tmp/nf.*
ascii:	nayeeve fassodd
cp1252:	naïve façade
latin1:	naïve façade
macroman:	naïve façade
utf16:	naïve façade
utf32:	naïve façade
utf8:	naïve façade
$ uniquote --html --verbose /tmp/nf.*
ascii:	nayeeve fassodd
cp1252:	naïve façade
latin1:	naïve façade
macroman:	naïve façade
utf16:	naïve façade
utf32:	naïve façade
utf8:	naïve façade
$ uniquote --backslash --encoding latin1 --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
cp1252:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
macroman:\tna\N{MESSAGE WAITING}ve fa\N{REVERSE LINE FEED}ade
\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0v\0e\0 \0f\0a\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0a\0d\0e\0
\0\0\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{LATIN CAPITAL LETTER A WITH TILDE}\N{MACRON}ve fa\N{LATIN CAPITAL LETTER A WITH TILDE}\N{SECTION SIGN}ade
$ uniquote --backslash --encoding cp1252 --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
uniquote: cp1252 "\x8D" does not map to Unicode at /tmp/nf.macroman line 0
cp1252:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
latin1:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0v\0e\0 \0f\0a\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0a\0d\0e\0
\0\0\N{LATIN SMALL LETTER THORN}\N{LATIN SMALL LETTER Y WITH DIAERESIS}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN SMALL LETTER I WITH DIAERESIS}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN SMALL LETTER C WITH CEDILLA}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{LATIN CAPITAL LETTER A WITH TILDE}\N{MACRON}ve fa\N{LATIN CAPITAL LETTER A WITH TILDE}\N{SECTION SIGN}ade
$ uniquote --backslash --encoding macroman --verbose /tmp/nf.*
ascii:\tnayeeve fassodd
cp1252:\tna\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}ve fa\N{LATIN CAPITAL LETTER A WITH ACUTE}ade
latin1:\tna\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}ve fa\N{LATIN CAPITAL LETTER A WITH ACUTE}ade
macroman:\tna\N{LATIN SMALL LETTER I WITH DIAERESIS}ve fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade
\N{OGONEK}\N{CARON}\0u\0t\0f\01\06\0:\0\t\0n\0a\0\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}\0v\0e\0 \0f\0a\0\N{LATIN CAPITAL LETTER A WITH ACUTE}\0a\0d\0e\0
\0\0\N{OGONEK}\N{CARON}\0\0\0u\0\0\0t\0\0\0f\0\0\03\0\0\02\0\0\0:\0\0\0\t\0\0\0n\0\0\0a\0\0\0\N{LATIN CAPITAL LETTER O WITH CIRCUMFLEX}\0\0\0v\0\0\0e\0\0\0 \0\0\0f\0\0\0a\0\0\0\N{LATIN CAPITAL LETTER A WITH ACUTE}\0\0\0a\0\0\0d\0\0\0e\0\0\0
utf8:\tna\N{SQUARE ROOT}\N{LATIN CAPITAL LETTER O WITH STROKE}ve fa\N{SQUARE ROOT}\N{LATIN SMALL LETTER SHARP S}ade
ERRORS
Exits 0 if all is well, 1 otherwise.
Errors include inaccessible files, bogus encodings, and contents that do not match a specified encoding.
BUGS
Good question.
SEE ALSO
od(1), cat(1), Encode(3)
HISTORY
First public release February 27, 2011.
AUTHOR
Tom Christiansen <tchrist@perl.com>
COPYRIGHT AND LICENCE
Copyright 2010 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.