305 lines
6.2 KiB
Plaintext
305 lines
6.2 KiB
Plaintext
=encoding utf-8
|
|
|
|
=head1 NAME
|
|
|
|
Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters
|
|
|
|
=head1 SYNOPSIS
|
|
|
|
use Unicode::GCString;
|
|
$gcstring = Unicode::GCString->new($string);
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
Unicode::GCString treats Unicode string as a sequence of
|
|
I<extended grapheme clusters> defined by Unicode Standard Annex #29 [UAX #29].
|
|
|
|
B<Grapheme cluster> is a sequence of Unicode character(s) that consists of one
|
|
B<grapheme base> and optional B<grapheme extender> and/or
|
|
B<“prepend” character>. It is close in that people consider as I<character>.
|
|
|
|
=head2 Public Interface
|
|
|
|
=head3 Constructors
|
|
|
|
=over 4
|
|
|
|
=item new (STRING, [KEY => VALUE, ...])
|
|
|
|
=item new (STRING, [LINEBREAK])
|
|
|
|
I<Constructor>.
|
|
Create new grapheme cluster string (Unicode::GCString object) from
|
|
Unicode string STRING.
|
|
|
|
About optional KEY => VALUE pairs see L<Unicode::LineBreak/Options>.
|
|
On second form, L<Unicode::LineBreak> object LINEBREAK controls
|
|
breaking features.
|
|
|
|
B<Note>:
|
|
The first form was introduced by release 2012.10.
|
|
|
|
=item copy
|
|
|
|
I<Copy constructor>.
|
|
Create a copy of grapheme cluster string.
|
|
Next position of new string is set at beginning.
|
|
|
|
=back
|
|
|
|
=head3 Sizes
|
|
|
|
=over 4
|
|
|
|
=item chars
|
|
|
|
I<Instance method>.
|
|
Returns number of Unicode characters grapheme cluster string includes,
|
|
i.e. length as Unicode string.
|
|
|
|
=item columns
|
|
|
|
I<Instance method>.
|
|
Returns total number of columns of grapheme clusters
|
|
defined by built-in character database.
|
|
For more details see L<Unicode::LineBreak/DESCRIPTION>.
|
|
|
|
=item length
|
|
|
|
I<Instance method>.
|
|
Returns number of grapheme clusters contained in grapheme cluster string.
|
|
|
|
=back
|
|
|
|
=head3 Operations as String
|
|
|
|
=over 4
|
|
|
|
=item as_string
|
|
|
|
=item C<">OBJECTC<">
|
|
|
|
I<Instance method>.
|
|
Convert grapheme cluster string to Unicode string explicitly.
|
|
|
|
=item cmp (STRING)
|
|
|
|
=item STRING C<cmp> STRING
|
|
|
|
I<Instance method>.
|
|
Compare strings. There are no oddities.
|
|
One of each STRING may be Unicode string.
|
|
|
|
=item concat (STRING)
|
|
|
|
=item STRING C<.> STRING
|
|
|
|
I<Instance method>.
|
|
Concatenate STRINGs. One of each STRING may be Unicode string.
|
|
Note that number of columns (see columns()) or grapheme clusters
|
|
(see length()) of resulting string is not always equal to sum of both
|
|
strings.
|
|
Next position of new string is that set on the left value.
|
|
|
|
=item join ([STRING, ...])
|
|
|
|
I<Instance method>.
|
|
Join STRINGs inserting grapheme cluster string.
|
|
Any of STRINGs may be Unicode string.
|
|
|
|
=item substr (OFFSET, [LENGTH, [REPLACEMENT]])
|
|
|
|
I<Instance method>.
|
|
Returns substring of grapheme cluster string.
|
|
OFFSET and LENGTH are based on grapheme clusters.
|
|
If REPLACEMENT is specified, substring is replaced by it.
|
|
REPLACEMENT may be Unicode string.
|
|
|
|
Note:
|
|
This method cannot return the lvalue, unlike built-in substr().
|
|
|
|
=back
|
|
|
|
=head3 Operations as Sequence of Grapheme Clusters
|
|
|
|
=over 4
|
|
|
|
=item as_array
|
|
|
|
=item C<@{>OBJECTC<}>
|
|
|
|
=item as_arrayref
|
|
|
|
I<Instance method>.
|
|
Convert grapheme cluster string to an array of grapheme clusters.
|
|
|
|
=item eos
|
|
|
|
I<Instance method>.
|
|
Test if current position is at end of grapheme cluster string.
|
|
|
|
=item item ([OFFSET])
|
|
|
|
I<Instance method>.
|
|
Returns OFFSET-th grapheme cluster.
|
|
If OFFSET was not specified, returns next grapheme cluster.
|
|
|
|
=item next
|
|
|
|
=item C<E<lt>>OBJECTC<E<gt>>
|
|
|
|
I<Instance method>, iterative.
|
|
Returns next grapheme cluster and increment next position.
|
|
|
|
=item pos ([OFFSET])
|
|
|
|
I<Instance method>.
|
|
If optional OFFSET is specified, set next position by it.
|
|
Returns next position of grapheme cluster string.
|
|
|
|
=back
|
|
|
|
=begin comment
|
|
|
|
=head4 Methods planned to be deprecated
|
|
|
|
=over 4
|
|
|
|
=item flag ([OFFSET, [VALUE]])
|
|
|
|
I<Instance method>.
|
|
Get or set flag value of OFFEST-th grapheme cluster.
|
|
If OFFSET was not specified, returns flag value of next grapheme cluster.
|
|
Flag value is an non-zero integer not greater than 255 and initially is 0.
|
|
|
|
Predefined flags are:
|
|
|
|
=over 4
|
|
|
|
=item Unicode::LineBreak::ALLOW_BEFORE
|
|
|
|
Allow line breaking just before this grapheme cluster.
|
|
|
|
=item Unicode::LineBreak::PROHIBIT_BEFORE
|
|
|
|
Prohibit line breaking just before this grapheme cluster.
|
|
|
|
=back
|
|
|
|
=item lbclass ([OFFSET])
|
|
|
|
I<Instance method>.
|
|
Returns Line Breaking Class (See L<Unicode::LineBreak>) of the first
|
|
character of OFFSET-th grapheme cluster.
|
|
If OFFSET was not specified, returns class of next grapheme cluster.
|
|
|
|
B<Note>:
|
|
Use lbc().
|
|
|
|
=item lbclass_ext ([OFFSET])
|
|
|
|
I<Instance method>.
|
|
Returns Line Breaking Class (See L<Unicode::LineBreak>) of the last
|
|
grapheme extender of OFFSET-th grapheme cluster. If there are no
|
|
grapheme extenders or its class is CM, value of lbclass() is returned.
|
|
|
|
B<Note>:
|
|
Use lbcext().
|
|
|
|
=back
|
|
|
|
=end comment
|
|
|
|
=head3 Miscelaneous
|
|
|
|
=over 4
|
|
|
|
=item lbc
|
|
|
|
I<Instance method>.
|
|
Returns Line Breaking Class (See L<Unicode::LineBreak>) of the first
|
|
character of first grapheme cluster.
|
|
|
|
=item lbcext
|
|
|
|
I<Instance method>.
|
|
Returns Line Breaking Class (See L<Unicode::LineBreak>) of the last
|
|
grapheme extender of last grapheme cluster.
|
|
If there are no grapheme extenders or its class is CM, value of last
|
|
grapheme base will be returned.
|
|
|
|
=back
|
|
|
|
=head1 CAVEATS
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
The grapheme cluster should not be referred to as "grapheme"
|
|
even though Larry does.
|
|
|
|
=item *
|
|
|
|
On Perl around 5.10.1, implicit conversion from Unicode::GCString object to
|
|
Unicode string sometimes let C<"utf8_mg_pos_cache_update"> cache be confused.
|
|
|
|
For example, instead of doing
|
|
|
|
$sub = substr($gcstring, $i, $j);
|
|
|
|
do
|
|
|
|
$sub = substr("$gcstring", $i, $j);
|
|
|
|
$sub = substr($gcstring->as_string, $i, $j);
|
|
|
|
=item *
|
|
|
|
This module implements I<default> algorithm for determining grapheme cluster
|
|
boundaries. Tailoring mechanism has not been supported yet.
|
|
|
|
=back
|
|
|
|
=head1 VERSION
|
|
|
|
Consult $VERSION variable.
|
|
|
|
=head2 Incompatible Changes
|
|
|
|
=over 4
|
|
|
|
=item Release 2013.10
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
The new() method can take non-Unicode string argument.
|
|
In this case it will be decoded by iso-8859-1 (Latin 1) character set.
|
|
That method of former releases would die with non-Unicode inputs.
|
|
|
|
=back
|
|
|
|
=back
|
|
|
|
=head1 SEE ALSO
|
|
|
|
[UAX #29]
|
|
Mark Davis (ed.) (2009-2013).
|
|
I<Unicode Standard Annex #29: Unicode Text Segmentation>, Revisions 15-23.
|
|
L<http://www.unicode.org/reports/tr29/>.
|
|
|
|
=head1 AUTHOR
|
|
|
|
Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
|
|
|
|
=head1 COPYRIGHT
|
|
|
|
Copyright (C) 2009-2013 Hatuka*nezumi - IKEDA Soji.
|
|
|
|
This program is free software; you can redistribute it and/or modify it
|
|
under the same terms as Perl itself.
|
|
|
|
=cut
|