208 lines
6.3 KiB
Plaintext
208 lines
6.3 KiB
Plaintext
=head1 NAME
|
|
|
|
Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form
|
|
|
|
=head1 SYNOPSIS
|
|
|
|
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
|
|
|
|
use warnings FATAL => 'utf8'; # fatalize encoding glitches
|
|
$string = decode_utf8($octets);
|
|
$octets = encode_utf8($string);
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
This module provides functions to encode and decode UTF-8 encoding form as
|
|
specified by Unicode and ISO/IEC 10646:2011.
|
|
|
|
=head1 FUNCTIONS
|
|
|
|
=head2 decode_utf8
|
|
|
|
$string = decode_utf8($octets);
|
|
$string = decode_utf8($octets, $fallback);
|
|
|
|
Returns an decoded representation of C<$octets> in UTF-8 encoding as a character
|
|
string.
|
|
|
|
C<$fallback> is an optional C<CODE> reference which provides a error-handling
|
|
mechanism, allowing customization of error handling. The default error-handling
|
|
mechanism is to replace any ill-formed UTF-8 sequences or encoded code points
|
|
which can't be interchanged with REPLACEMENT CHARACTER (U+FFFD).
|
|
|
|
$string = $fallback->($octets, $is_usv, $position);
|
|
|
|
C<$fallback> is invoked with three arguments: C<$octets>, C<$is_usv> and
|
|
C<$position>. C<$octets> is a sequence of one or more octets containing the
|
|
maximal subpart of the ill-formed subsequence or encoded code point which
|
|
can't be interchanged. C<$is_usv> is a boolean indicating whether or not
|
|
C<$octets> represent a encoded Unicode scalar value. C<$position> is a
|
|
unsigned integer containing the zero based octet position at which the error
|
|
occurred within the octets provided to C<decode_utf8()>. C<$fallback> must
|
|
return a character string consisting of zero or more Unicode scalar values.
|
|
Unicode scalar values consist of code points in the range U+0000..U+D7FF and
|
|
U+E000..U+10FFFF.
|
|
|
|
=head2 encode_utf8
|
|
|
|
$octets = encode_utf8($string);
|
|
$octets = encode_utf8($string, $fallback);
|
|
|
|
Returns an encoded representation of C<$string> in UTF-8 encoding as an octet
|
|
string.
|
|
|
|
C<$fallback> is an optional C<CODE> reference which provides a error-handling
|
|
mechanism, allowing customization of error handling. The default error-handling
|
|
mechanism is to replace any code points which can't be interchanged or represented
|
|
in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).
|
|
|
|
$string = $fallback->($codepoint, $is_usv, $position);
|
|
|
|
C<$fallback> is invoked with three arguments: C<$codepoint>, C<$is_usv> and
|
|
C<$position>. C<$codepoint> is a unsigned integer containing the code point
|
|
which can't be interchanged or represented in UTF-8 encoding form. C<$is_usv>
|
|
is a boolean indicating whether or not C<$codepoint> is a Unicode scalar value.
|
|
C<$position> is a unsigned integer containing the zero based character position
|
|
at which the error occurred within the string provided to C<encode_utf8()>.
|
|
C<$fallback> must return a character string consisting of zero or more Unicode
|
|
scalar values.Unicode scalar values consist of code points in the range
|
|
U+0000..U+D7FF and U+E000..U+10FFFF.
|
|
|
|
=head2 valid_utf8
|
|
|
|
$boolean = valid_utf8($octets);
|
|
|
|
Returns a boolean indicating whether or not the given C<$octets> consist of
|
|
well-formed UTF-8 sequences.
|
|
|
|
=head1 EXPORTS
|
|
|
|
None by default. All functions can be exported using the C<:all> tag or individually.
|
|
|
|
=head1 DIAGNOSTICS
|
|
|
|
=over 4
|
|
|
|
=item Can't decode a wide character string
|
|
|
|
(F) Wide character in octets.
|
|
|
|
=item Can't validate a wide character string
|
|
|
|
(F) Wide character in octets.
|
|
|
|
=item Can't decode ill-formed UTF-8 octet sequence <%s> in position %u
|
|
|
|
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a
|
|
hexadecimal representation of the maximal subpart of the ill-formed subsequence.
|
|
|
|
=item Can't interchange noncharacter code point U+%X in position %u
|
|
|
|
(W utf8, nonchar) Noncharacters are code points that are permanently reserved
|
|
in the Unicode Standard for internal use. They are forbidden for use in open
|
|
interchange of Unicode text data. Noncharacters consist of the values U+nFFFE
|
|
and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
|
|
|
|
=item Can't represent surrogate code point U+%X in position %u
|
|
|
|
(W utf8, surrogate) Surrogate code points are designated only for surrogate code
|
|
units in the UTF-16 character encoding form. Surrogates consist of code points
|
|
in the range U+D800 to U+DFFF.
|
|
|
|
=item Can't represent super code point \x{%X} in position %u
|
|
|
|
(W utf8, non_unicode) Code points greater than U+10FFFF. Perl's extended codespace.
|
|
|
|
=item Can't decode ill-formed UTF-X octet sequence <%s> in position %u
|
|
|
|
(F) Encountered an ill-formed octet sequence in Perl's internal representation
|
|
of wide characters.
|
|
|
|
=back
|
|
|
|
The sub-categories: C<nonchar>, C<surrogate> and C<non_unicode> is only available
|
|
on Perl 5.14 or greater. See L<perllexwarn> for available categories and hierarchies.
|
|
|
|
=head1 COMPARISON
|
|
|
|
Here is a summary of features for comparison with L<Encode>'s UTF-8 implementation:
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
Simple API which makes use of Perl's standard warning categories.
|
|
|
|
=item *
|
|
|
|
Recognizes all noncharacters regardless of Perl version
|
|
|
|
=item *
|
|
|
|
Implements Unicode's recommended practice for using U+FFFD.
|
|
|
|
=item *
|
|
|
|
Better diagnostics in warning messages
|
|
|
|
=item *
|
|
|
|
Detects and reports inconsistency in Perl's internal representation of
|
|
wide characters (UTF-X)
|
|
|
|
=item *
|
|
|
|
Preserves taintedness of decoded C<$octets> or encoded C<$string>
|
|
|
|
=item *
|
|
|
|
Better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%,
|
|
see benchmarks directory in git repository)
|
|
|
|
=back
|
|
|
|
=head1 CONFORMANCE
|
|
|
|
It's the author's belief that this UTF-8 implementation is conformant with
|
|
the Unicode Standard Version 6.0. Any deviations from the Unicode Standard
|
|
is to be considered a bug.
|
|
|
|
=head1 SEE ALSO
|
|
|
|
=over 4
|
|
|
|
=item L<Encode>
|
|
|
|
=item L<http://www.unicode.org/>
|
|
|
|
=back
|
|
|
|
=head1 SUPPORT
|
|
|
|
=head2 BUGS
|
|
|
|
Please report any bugs by email to C<bug-unicode-utf8 at rt.cpan.org>, or
|
|
through the web interface at L<http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8>.
|
|
You will be automatically notified of any progress on the request by the system.
|
|
|
|
=head2 SOURCE CODE
|
|
|
|
This is open source software. The code repository is available for public
|
|
review and contribution under the terms of the license.
|
|
|
|
L<http://github.com/chansen/p5-unicode-utf8>
|
|
|
|
git clone http://github.com/chansen/p5-unicode-utf8
|
|
|
|
=head1 AUTHOR
|
|
|
|
Christian Hansen C<chansen@cpan.org>
|
|
|
|
=head1 COPYRIGHT
|
|
|
|
Copyright 2011-2017 by Christian Hansen.
|
|
|
|
This is free software; you can redistribute it and/or modify it under
|
|
the same terms as the Perl 5 programming language system itself.
|
|
|