Initial Commit
This commit is contained in:
209
database/perl/lib/pods/perlunitut.pod
Normal file
209
database/perl/lib/pods/perlunitut.pod
Normal file
@@ -0,0 +1,209 @@
|
||||
=head1 NAME
|
||||
|
||||
perlunitut - Perl Unicode Tutorial
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The days of just flinging strings around are over. It's well established that
|
||||
modern programs need to be capable of communicating funny accented letters, and
|
||||
things like euro symbols. This means that programmers need new habits. It's
|
||||
easy to program Unicode capable software, but it does require discipline to do
|
||||
it right.
|
||||
|
||||
There's a lot to know about character sets, and text encodings. It's probably
|
||||
best to spend a full day learning all this, but the basics can be learned in
|
||||
minutes.
|
||||
|
||||
These are not the very basics, though. It is assumed that you already
|
||||
know the difference between bytes and characters, and realise (and accept!)
|
||||
that there are many different character sets and encodings, and that your
|
||||
program has to be explicit about them. Recommended reading is "The Absolute
|
||||
Minimum Every Software Developer Absolutely, Positively Must Know About Unicode
|
||||
and Character Sets (No Excuses!)" by Joel Spolsky, at
|
||||
L<http://joelonsoftware.com/articles/Unicode.html>.
|
||||
|
||||
This tutorial speaks in rather absolute terms, and provides only a limited view
|
||||
of the wealth of character string related features that Perl has to offer. For
|
||||
most projects, this information will probably suffice.
|
||||
|
||||
=head2 Definitions
|
||||
|
||||
It's important to set a few things straight first. This is the most important
|
||||
part of this tutorial. This view may conflict with other information that you
|
||||
may have found on the web, but that's mostly because many sources are wrong.
|
||||
|
||||
You may have to re-read this entire section a few times...
|
||||
|
||||
=head3 Unicode
|
||||
|
||||
B<Unicode> is a character set with room for lots of characters. The ordinal
|
||||
value of a character is called a B<code point>. (But in practice, the
|
||||
distinction between code point and character is blurred, so the terms often
|
||||
are used interchangeably.)
|
||||
|
||||
There are many, many code points, but computers work with bytes, and a byte has
|
||||
room for only 256 values. Unicode has many more characters than that,
|
||||
so you need a method to make these accessible.
|
||||
|
||||
Unicode is encoded using several competing encodings, of which UTF-8 is the
|
||||
most used. In a Unicode encoding, multiple subsequent bytes can be used to
|
||||
store a single code point, or simply: character.
|
||||
|
||||
=head3 UTF-8
|
||||
|
||||
B<UTF-8> is a Unicode encoding. Many people think that Unicode and UTF-8 are
|
||||
the same thing, but they're not. There are more Unicode encodings, but much of
|
||||
the world has standardized on UTF-8.
|
||||
|
||||
UTF-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take
|
||||
only one byte per character. All other characters are encoded as two to
|
||||
four bytes using a complex scheme. Fortunately, Perl handles this for
|
||||
us, so we don't have to worry about this.
|
||||
|
||||
=head3 Text strings (character strings)
|
||||
|
||||
B<Text strings>, or B<character strings> are made of characters. Bytes are
|
||||
irrelevant here, and so are encodings. Each character is just that: the
|
||||
character.
|
||||
|
||||
On a text string, you would do things like:
|
||||
|
||||
$text =~ s/foo/bar/;
|
||||
if ($string =~ /^\d+$/) { ... }
|
||||
$text = ucfirst $text;
|
||||
my $character_count = length $text;
|
||||
|
||||
The value of a character (C<ord>, C<chr>) is the corresponding Unicode code
|
||||
point.
|
||||
|
||||
=head3 Binary strings (byte strings)
|
||||
|
||||
B<Binary strings>, or B<byte strings> are made of bytes. Here, you don't have
|
||||
characters, just bytes. All communication with the outside world (anything
|
||||
outside of your current Perl process) is done in binary.
|
||||
|
||||
On a binary string, you would do things like:
|
||||
|
||||
my (@length_content) = unpack "(V/a)*", $binary;
|
||||
$binary =~ s/\x00\x0F/\xFF\xF0/; # for the brave :)
|
||||
print {$fh} $binary;
|
||||
my $byte_count = length $binary;
|
||||
|
||||
=head3 Encoding
|
||||
|
||||
B<Encoding> (as a verb) is the conversion from I<text> to I<binary>. To encode,
|
||||
you have to supply the target encoding, for example C<iso-8859-1> or C<UTF-8>.
|
||||
Some encodings, like the C<iso-8859> ("latin") range, do not support the full
|
||||
Unicode standard; characters that can't be represented are lost in the
|
||||
conversion.
|
||||
|
||||
=head3 Decoding
|
||||
|
||||
B<Decoding> is the conversion from I<binary> to I<text>. To decode, you have to
|
||||
know what encoding was used during the encoding phase. And most of all, it must
|
||||
be something decodable. It doesn't make much sense to decode a PNG image into a
|
||||
text string.
|
||||
|
||||
=head3 Internal format
|
||||
|
||||
Perl has an B<internal format>, an encoding that it uses to encode text strings
|
||||
so it can store them in memory. All text strings are in this internal format.
|
||||
In fact, text strings are never in any other format!
|
||||
|
||||
You shouldn't worry about what this format is, because conversion is
|
||||
automatically done when you decode or encode.
|
||||
|
||||
=head2 Your new toolkit
|
||||
|
||||
Add to your standard heading the following line:
|
||||
|
||||
use Encode qw(encode decode);
|
||||
|
||||
Or, if you're lazy, just:
|
||||
|
||||
use Encode;
|
||||
|
||||
=head2 I/O flow (the actual 5 minute tutorial)
|
||||
|
||||
The typical input/output flow of a program is:
|
||||
|
||||
1. Receive and decode
|
||||
2. Process
|
||||
3. Encode and output
|
||||
|
||||
If your input is binary, and is supposed to remain binary, you shouldn't decode
|
||||
it to a text string, of course. But in all other cases, you should decode it.
|
||||
|
||||
Decoding can't happen reliably if you don't know how the data was encoded. If
|
||||
you get to choose, it's a good idea to standardize on UTF-8.
|
||||
|
||||
my $foo = decode('UTF-8', get 'http://example.com/');
|
||||
my $bar = decode('ISO-8859-1', readline STDIN);
|
||||
my $xyzzy = decode('Windows-1251', $cgi->param('foo'));
|
||||
|
||||
Processing happens as you knew before. The only difference is that you're now
|
||||
using characters instead of bytes. That's very useful if you use things like
|
||||
C<substr>, or C<length>.
|
||||
|
||||
It's important to realize that there are no bytes in a text string. Of course,
|
||||
Perl has its internal encoding to store the string in memory, but ignore that.
|
||||
If you have to do anything with the number of bytes, it's probably best to move
|
||||
that part to step 3, just after you've encoded the string. Then you know
|
||||
exactly how many bytes it will be in the destination string.
|
||||
|
||||
The syntax for encoding text strings to binary strings is as simple as decoding:
|
||||
|
||||
$body = encode('UTF-8', $body);
|
||||
|
||||
If you needed to know the length of the string in bytes, now's the perfect time
|
||||
for that. Because C<$body> is now a byte string, C<length> will report the
|
||||
number of bytes, instead of the number of characters. The number of
|
||||
characters is no longer known, because characters only exist in text strings.
|
||||
|
||||
my $byte_count = length $body;
|
||||
|
||||
And if the protocol you're using supports a way of letting the recipient know
|
||||
which character encoding you used, please help the receiving end by using that
|
||||
feature! For example, E-mail and HTTP support MIME headers, so you can use the
|
||||
C<Content-Type> header. They can also have C<Content-Length> to indicate the
|
||||
number of I<bytes>, which is always a good idea to supply if the number is
|
||||
known.
|
||||
|
||||
"Content-Type: text/plain; charset=UTF-8",
|
||||
"Content-Length: $byte_count"
|
||||
|
||||
=head1 SUMMARY
|
||||
|
||||
Decode everything you receive, encode everything you send out. (If it's text
|
||||
data.)
|
||||
|
||||
=head1 Q and A (or FAQ)
|
||||
|
||||
After reading this document, you ought to read L<perlunifaq> too, then
|
||||
L<perluniintro>.
|
||||
|
||||
=head1 ACKNOWLEDGEMENTS
|
||||
|
||||
Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the
|
||||
Amsterdam Perl Mongers meetings got me interested and determined to find out
|
||||
how to use character encodings in Perl in ways that don't break easily.
|
||||
|
||||
Thanks to Gerard Goossen from TTY. His presentation "UTF-8 in the wild" (Dutch
|
||||
Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial.
|
||||
|
||||
Thanks to the people who asked about this kind of stuff in several Perl IRC
|
||||
channels, and have constantly reminded me that a simpler explanation was
|
||||
needed.
|
||||
|
||||
Thanks to the people who reviewed this document for me, before it went public.
|
||||
They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan
|
||||
Gray.
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Juerd Waalboer <#####@juerd.nl>
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<perlunifaq>, L<perlunicode>, L<perluniintro>, L<Encode>
|
||||
|
||||
Reference in New Issue
Block a user