Initial Commit
This commit is contained in:
203
database/perl/vendor/lib/HTML/AsSubs.pm
vendored
Normal file
203
database/perl/vendor/lib/HTML/AsSubs.pm
vendored
Normal file
@@ -0,0 +1,203 @@
|
||||
package HTML::AsSubs;
|
||||
|
||||
# ABSTRACT: functions that construct a HTML syntax tree
|
||||
|
||||
|
||||
use warnings;
|
||||
use strict;
|
||||
use vars qw(@ISA @EXPORT);
|
||||
|
||||
our $VERSION = '5.07'; # VERSION from OurPkgVersion
|
||||
|
||||
require HTML::Element;
|
||||
require Exporter;
|
||||
@ISA = qw(Exporter);
|
||||
|
||||
# Problem: exports so damned much. Has no concept of "export only HTML4
|
||||
# elements". TODO:?? make something that make functions that just
|
||||
# wrap XML::Generator calls?
|
||||
|
||||
|
||||
use vars qw(@TAGS);
|
||||
@TAGS = qw(html
|
||||
head title base link meta isindex nextid script style
|
||||
body h1 h2 h3 h4 h5 h6 p pre div blockquote
|
||||
a img br hr
|
||||
ol ul dir menu li
|
||||
dl dt dd
|
||||
dfn cite code em kbd samp strong var address span
|
||||
b i u tt
|
||||
center font big small strike
|
||||
sub sup
|
||||
table tr td th caption
|
||||
form input select option textarea
|
||||
object applet param
|
||||
map area
|
||||
frame frameset noframe
|
||||
);
|
||||
|
||||
for (@TAGS) {
|
||||
my $code;
|
||||
$code = "sub $_ { _elem('$_', \@_); }\n";
|
||||
push( @EXPORT, $_ );
|
||||
## no critic
|
||||
eval $code;
|
||||
## use critic
|
||||
if ($@) {
|
||||
die $@;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
sub _elem {
|
||||
my $tag = shift;
|
||||
my $attributes;
|
||||
if ( @_ and defined $_[0] and ref( $_[0] ) eq "HASH" ) {
|
||||
$attributes = shift;
|
||||
}
|
||||
my $elem = HTML::Element->new( $tag, %$attributes );
|
||||
$elem->push_content(@_);
|
||||
$elem;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=pod
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::AsSubs - functions that construct a HTML syntax tree
|
||||
|
||||
=head1 VERSION
|
||||
|
||||
This document describes version 5.07 of
|
||||
HTML::AsSubs, released August 31, 2017
|
||||
as part of L<HTML-Tree|HTML::Tree>.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
use HTML::AsSubs;
|
||||
$h = body(
|
||||
h1("This is the heading"),
|
||||
p("This is the first paragraph which contains a ",
|
||||
a({href=>'link.html'}, "link"),
|
||||
" and an ",
|
||||
img({src=>'img.gif', alt=>'image'}),
|
||||
"."
|
||||
),
|
||||
);
|
||||
print $h->as_HTML;
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
This module exports functions that can be used to construct various
|
||||
HTML elements. The functions are named after the tags of the
|
||||
corresponding HTML element and are all written in lower case. If the
|
||||
first argument is a hash reference then it will be used to initialize the
|
||||
attributes of this element. The remaining arguments are regarded as
|
||||
content.
|
||||
|
||||
For a similar idea (i.e., it's another case where the syntax tree
|
||||
of the Perl source mirrors the syntax tree of the HTML produced),
|
||||
see HTML::Element's C<new_from_lol> method.
|
||||
|
||||
For what I now think is a cleaner implementation of this same idea,
|
||||
see the excellent module C<XML::Generator>, which is what I suggest
|
||||
for actual real-life use. (I suggest this over C<HTML::AsSubs> and
|
||||
over C<CGI.pm>'s HTML-making functions.)
|
||||
|
||||
=head1 ACKNOWLEDGEMENT
|
||||
|
||||
This module was inspired by the following message:
|
||||
|
||||
Date: Tue, 4 Oct 1994 16:11:30 +0100
|
||||
Subject: Wow! I have a large lightbulb above my head!
|
||||
|
||||
Take a moment to consider these lines:
|
||||
|
||||
%OVERLOAD=( '""' => sub { join("", @{$_[0]}) } );
|
||||
|
||||
sub html { my($type)=shift; bless ["<$type>", @_, "</$type>"]; }
|
||||
|
||||
:-) I *love* Perl 5! Thankyou Larry and Ilya.
|
||||
|
||||
Regards,
|
||||
Tim Bunce.
|
||||
|
||||
p.s. If you didn't get it, think about recursive data types: html(html())
|
||||
p.p.s. I'll turn this into a much more practical example in a day or two.
|
||||
p.p.p.s. It's a pity that overloads are not inherited. Is this a bug?
|
||||
|
||||
=head1 BUGS
|
||||
|
||||
The exported link() function overrides the builtin link() function.
|
||||
The exported tr() function must be called using &tr(...) syntax
|
||||
because it clashes with the builtin tr/../../ operator.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Element>, L<XML::Generator>
|
||||
|
||||
=head2 html head title base link meta isindex nextid script style body h1 h2 h3 h4 h5 h6 p pre div blockquote a img br hr ol ul dir menu li dl dt dd dfn cite code em kbd samp strong var address span b i u tt center font big small strike sub sup table tr td th caption form input select option textarea object applet param map area frame frameset noframe
|
||||
|
||||
A bunch of methods for creating tags.
|
||||
|
||||
=head1 Private Functions
|
||||
|
||||
=head2 _elem()
|
||||
|
||||
The _elem() function is wrapped by all the html 'tag' functions. It
|
||||
takes a tag-name, optional hashref of attributes and a list of content
|
||||
as parameters.
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Current maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
|
||||
|
||||
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
Original HTML-Tree author:
|
||||
|
||||
=over
|
||||
|
||||
=item * Gisle Aas
|
||||
|
||||
=back
|
||||
|
||||
Former maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Sean M. Burke
|
||||
|
||||
=item * Andy Lester
|
||||
|
||||
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
You can follow or contribute to HTML-Tree's development at
|
||||
L<< https://github.com/kentfredric/HTML-Tree >>.
|
||||
|
||||
=head1 COPYRIGHT AND LICENSE
|
||||
|
||||
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
|
||||
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
|
||||
2012 Christopher J. Madsen.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
The programs in this library are distributed in the hope that they
|
||||
will be useful, but without any warranty; without even the implied
|
||||
warranty of merchantability or fitness for a particular purpose.
|
||||
|
||||
=cut
|
||||
4486
database/perl/vendor/lib/HTML/Element.pm
vendored
Normal file
4486
database/perl/vendor/lib/HTML/Element.pm
vendored
Normal file
File diff suppressed because it is too large
Load Diff
374
database/perl/vendor/lib/HTML/Element/traverse.pm
vendored
Normal file
374
database/perl/vendor/lib/HTML/Element/traverse.pm
vendored
Normal file
@@ -0,0 +1,374 @@
|
||||
|
||||
# This is a .pm just to (try to) make some CPAN document converters
|
||||
# convert it happily as part of the dist's documentation tree.
|
||||
package HTML::Element::traverse;
|
||||
|
||||
# ABSTRACT: discussion of HTML::Element's traverse method
|
||||
|
||||
use warnings;
|
||||
use strict;
|
||||
|
||||
our $VERSION = '5.07'; # VERSION from OurPkgVersion
|
||||
|
||||
use HTML::Element ();
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=pod
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Element::traverse - discussion of HTML::Element's traverse method
|
||||
|
||||
=head1 VERSION
|
||||
|
||||
This document describes version 5.07 of
|
||||
HTML::Element::traverse, released August 31, 2017
|
||||
as part of L<HTML-Tree|HTML::Tree>.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
# $element->traverse is unnecessary and obscure.
|
||||
# Don't use it in new code.
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
C<HTML::Element> provides a method C<traverse> that traverses the tree
|
||||
and calls user-specified callbacks for each node, in pre- or
|
||||
post-order. However, use of the method is quite superfluous: if you
|
||||
want to recursively visit every node in the tree, it's almost always
|
||||
simpler to write a subroutine does just that, than it is to bundle up
|
||||
the pre- and/or post-order code in callbacks for the C<traverse>
|
||||
method.
|
||||
|
||||
=head1 EXAMPLES
|
||||
|
||||
Suppose you want to traverse at/under a node $tree and give elements
|
||||
an 'id' attribute unless they already have one.
|
||||
|
||||
You can use the C<traverse> method:
|
||||
|
||||
{
|
||||
my $counter = 'x0000';
|
||||
$start_node->traverse(
|
||||
[ # Callbacks;
|
||||
# pre-order callback:
|
||||
sub {
|
||||
my $x = $_[0];
|
||||
$x->attr('id', $counter++) unless defined $x->attr('id');
|
||||
return HTML::Element::OK; # keep traversing
|
||||
},
|
||||
# post-order callback:
|
||||
undef
|
||||
],
|
||||
1, # don't call the callbacks for text nodes
|
||||
);
|
||||
}
|
||||
|
||||
or you can just be simple and clear (and not have to understand the
|
||||
calling format for C<traverse>) by writing a sub that traverses the
|
||||
tree by just calling itself:
|
||||
|
||||
{
|
||||
my $counter = 'x0000';
|
||||
sub give_id {
|
||||
my $x = $_[0];
|
||||
$x->attr('id', $counter++) unless defined $x->attr('id');
|
||||
foreach my $c ($x->content_list) {
|
||||
give_id($c) if ref $c; # ignore text nodes
|
||||
}
|
||||
};
|
||||
give_id($start_node);
|
||||
}
|
||||
|
||||
See, isn't that nice and clear?
|
||||
|
||||
But, if you really need to know:
|
||||
|
||||
=head1 THE TRAVERSE METHOD
|
||||
|
||||
The C<traverse()> method is a general object-method for traversing a
|
||||
tree or subtree and calling user-specified callbacks. It accepts the
|
||||
following syntaxes:
|
||||
|
||||
=over
|
||||
|
||||
=item $h->traverse(\&callback)
|
||||
|
||||
=item or $h->traverse(\&callback, $ignore_text)
|
||||
|
||||
=item or $h->traverse( [\&pre_callback,\&post_callback] , $ignore_text)
|
||||
|
||||
=back
|
||||
|
||||
These all mean to traverse the element and all of its children. That
|
||||
is, this method starts at node $h, "pre-order visits" $h, traverses its
|
||||
children, and then will "post-order visit" $h. "Visiting" means that
|
||||
the callback routine is called, with these arguments:
|
||||
|
||||
$_[0] : the node (element or text segment),
|
||||
$_[1] : a startflag, and
|
||||
$_[2] : the depth
|
||||
|
||||
If the $ignore_text parameter is given and true, then the pre-order
|
||||
call I<will not> be happen for text content.
|
||||
|
||||
The startflag is 1 when we enter a node (i.e., in pre-order calls) and
|
||||
0 when we leave the node (in post-order calls).
|
||||
|
||||
Note, however, that post-order calls don't happen for nodes that are
|
||||
text segments or are elements that are prototypically empty (like "br",
|
||||
"hr", etc.).
|
||||
|
||||
If we visit text nodes (i.e., unless $ignore_text is given and true),
|
||||
then when text nodes are visited, we will also pass two extra
|
||||
arguments to the callback:
|
||||
|
||||
$_[3] : the element that's the parent
|
||||
of this text node
|
||||
$_[4] : the index of this text node
|
||||
in its parent's content list
|
||||
|
||||
Note that you can specify that the pre-order routine can
|
||||
be a different routine from the post-order one:
|
||||
|
||||
$h->traverse( [\&pre_callback,\&post_callback], ...);
|
||||
|
||||
You can also specify that no post-order calls are to be made,
|
||||
by providing a false value as the post-order routine:
|
||||
|
||||
$h->traverse([ \&pre_callback,0 ], ...);
|
||||
|
||||
And similarly for suppressing pre-order callbacks:
|
||||
|
||||
$h->traverse([ 0,\&post_callback ], ...);
|
||||
|
||||
Note that these two syntaxes specify the same operation:
|
||||
|
||||
$h->traverse([\&foo,\&foo], ...);
|
||||
$h->traverse( \&foo , ...);
|
||||
|
||||
The return values from calls to your pre- or post-order
|
||||
routines are significant, and are used to control recursion
|
||||
into the tree.
|
||||
|
||||
These are the values you can return, listed in descending order
|
||||
of my estimation of their usefulness:
|
||||
|
||||
=over
|
||||
|
||||
=item HTML::Element::OK, 1, or any other true value
|
||||
|
||||
...to keep on traversing.
|
||||
|
||||
Note that C<HTML::Element::OK> et
|
||||
al are constants. So if you're running under C<use strict>
|
||||
(as I hope you are), and you say:
|
||||
C<return HTML::Element::PRUEN>
|
||||
the compiler will flag this as an error (an unallowable
|
||||
bareword, specifically), whereas if you spell PRUNE correctly,
|
||||
the compiler will not complain.
|
||||
|
||||
=item undef, 0, '0', '', or HTML::Element::PRUNE
|
||||
|
||||
...to block traversing under the current element's content.
|
||||
(This is ignored if received from a post-order callback,
|
||||
since by then the recursion has already happened.)
|
||||
If this is returned by a pre-order callback, no
|
||||
post-order callback for the current node will happen.
|
||||
(Recall that if your callback exits with just C<return;>,
|
||||
it is returning undef -- at least in scalar context, and
|
||||
C<traverse> always calls your callbacks in scalar context.)
|
||||
|
||||
=item HTML::Element::ABORT
|
||||
|
||||
...to abort the whole traversal immediately.
|
||||
This is often useful when you're looking for just the first
|
||||
node in the tree that meets some criterion of yours.
|
||||
|
||||
=item HTML::Element::PRUNE_UP
|
||||
|
||||
...to abort continued traversal into this node and its parent
|
||||
node. No post-order callback for the current or parent
|
||||
node will happen.
|
||||
|
||||
=item HTML::Element::PRUNE_SOFTLY
|
||||
|
||||
Like PRUNE, except that the post-order call for the current
|
||||
node is not blocked.
|
||||
|
||||
=back
|
||||
|
||||
Almost every task to do with extracting information from a tree can be
|
||||
expressed in terms of traverse operations (usually in only one pass,
|
||||
and usually paying attention to only pre-order, or to only
|
||||
post-order), or operations based on traversing. (In fact, many of the
|
||||
other methods in this class are basically calls to traverse() with
|
||||
particular arguments.)
|
||||
|
||||
The source code for HTML::Element and HTML::TreeBuilder contain
|
||||
several examples of the use of the "traverse" method to gather
|
||||
information about the content of trees and subtrees.
|
||||
|
||||
(Note: you should not change the structure of a tree I<while> you are
|
||||
traversing it.)
|
||||
|
||||
[End of documentation for the C<traverse()> method]
|
||||
|
||||
=head2 Traversing with Recursive Anonymous Routines
|
||||
|
||||
Now, if you've been reading
|
||||
I<Structure and Interpretation of Computer Programs> too much, maybe
|
||||
you even want a recursive lambda. Go ahead:
|
||||
|
||||
{
|
||||
my $counter = 'x0000';
|
||||
my $give_id;
|
||||
$give_id = sub {
|
||||
my $x = $_[0];
|
||||
$x->attr('id', $counter++) unless defined $x->attr('id');
|
||||
foreach my $c ($x->content_list) {
|
||||
$give_id->($c) if ref $c; # ignore text nodes
|
||||
}
|
||||
};
|
||||
$give_id->($start_node);
|
||||
undef $give_id;
|
||||
}
|
||||
|
||||
It's a bit nutty, and it's I<still> more concise than a call to the
|
||||
C<traverse> method!
|
||||
|
||||
It is left as an exercise to the reader to figure out how to do the
|
||||
same thing without using a C<$give_id> symbol at all.
|
||||
|
||||
It is also left as an exercise to the reader to figure out why I
|
||||
undefine C<$give_id>, above; and why I could achieved the same effect
|
||||
with any of:
|
||||
|
||||
$give_id = 'I like pie!';
|
||||
# or...
|
||||
$give_id = [];
|
||||
# or even;
|
||||
$give_id = sub { print "Mmmm pie!\n" };
|
||||
|
||||
But not:
|
||||
|
||||
$give_id = sub { print "I'm $give_id and I like pie!\n" };
|
||||
# nor...
|
||||
$give_id = \$give_id;
|
||||
# nor...
|
||||
$give_id = { 'pie' => \$give_id, 'mode' => 'a la' };
|
||||
|
||||
=head2 Doing Recursive Things Iteratively
|
||||
|
||||
Note that you may at times see an iterative implementation of
|
||||
pre-order traversal, like so:
|
||||
|
||||
{
|
||||
my @to_do = ($tree); # start-node
|
||||
while(@to_do) {
|
||||
my $this = shift @to_do;
|
||||
|
||||
# "Visit" the node:
|
||||
$this->attr('id', $counter++)
|
||||
unless defined $this->attr('id');
|
||||
|
||||
unshift @to_do, grep ref $_, $this->content_list;
|
||||
# Put children on the stack -- they'll be visited next
|
||||
}
|
||||
}
|
||||
|
||||
This can I<under certain circumstances> be more efficient than just a
|
||||
normal recursive routine, but at the cost of being rather obscure. It
|
||||
gains efficiency by avoiding the overhead of function-calling, but
|
||||
since there are several method dispatches however you do it (to
|
||||
C<attr> and C<content_list>), the overhead for a simple function call
|
||||
is insignificant.
|
||||
|
||||
=head2 Pruning and Whatnot
|
||||
|
||||
The C<traverse> method does have the fairly neat features of
|
||||
the C<ABORT>, C<PRUNE_UP> and C<PRUNE_SOFTLY> signals. None of these
|
||||
can be implemented I<totally> straightforwardly with recursive
|
||||
routines, but it is quite possible. C<ABORT>-like behavior can be
|
||||
implemented either with using non-local returning with C<eval>/C<die>:
|
||||
|
||||
my $died_on; # if you need to know where...
|
||||
sub thing {
|
||||
... visits $_[0]...
|
||||
... maybe set $died_on to $_[0] and die "ABORT_TRAV" ...
|
||||
... else call thing($child) for each child...
|
||||
...any post-order visiting $_[0]...
|
||||
}
|
||||
eval { thing($node) };
|
||||
if($@) {
|
||||
if($@ =~ m<^ABORT_TRAV>) {
|
||||
...it died (aborted) on $died_on...
|
||||
} else {
|
||||
die $@; # some REAL error happened
|
||||
}
|
||||
}
|
||||
|
||||
or you can just do it with flags:
|
||||
|
||||
my($abort_flag, $died_on);
|
||||
sub thing {
|
||||
... visits $_[0]...
|
||||
... maybe set $abort_flag = 1; $died_on = $_[0]; return;
|
||||
foreach my $c ($_[0]->content_list) {
|
||||
thing($c);
|
||||
return if $abort_flag;
|
||||
}
|
||||
...any post-order visiting $_[0]...
|
||||
return;
|
||||
}
|
||||
|
||||
$abort_flag = $died_on = undef;
|
||||
thing($node);
|
||||
...if defined $abort_flag, it died on $died_on
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Element>
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Current maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
|
||||
|
||||
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
Original HTML-Tree author:
|
||||
|
||||
=over
|
||||
|
||||
=item * Gisle Aas
|
||||
|
||||
=back
|
||||
|
||||
Former maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Sean M. Burke
|
||||
|
||||
=item * Andy Lester
|
||||
|
||||
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
You can follow or contribute to HTML-Tree's development at
|
||||
L<< https://github.com/kentfredric/HTML-Tree >>.
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 2000,2001 Sean M. Burke
|
||||
|
||||
=cut
|
||||
482
database/perl/vendor/lib/HTML/Entities.pm
vendored
Normal file
482
database/perl/vendor/lib/HTML/Entities.pm
vendored
Normal file
@@ -0,0 +1,482 @@
|
||||
package HTML::Entities;
|
||||
|
||||
=encoding utf8
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Entities - Encode or decode strings with HTML entities
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
use HTML::Entities;
|
||||
|
||||
$a = "Våre norske tegn bør æres";
|
||||
decode_entities($a);
|
||||
encode_entities($a, "\200-\377");
|
||||
|
||||
For example, this:
|
||||
|
||||
$input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
|
||||
print encode_entities($input), "\n"
|
||||
|
||||
Prints this out:
|
||||
|
||||
vis-à-vis Beyoncé's naïve
|
||||
papier-mâché résumé
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
This module deals with encoding and decoding of strings with HTML
|
||||
character entities. The module provides the following functions:
|
||||
|
||||
=over 4
|
||||
|
||||
=item decode_entities( $string, ... )
|
||||
|
||||
This routine replaces HTML entities found in the $string with the
|
||||
corresponding Unicode character. Unrecognized entities are left alone.
|
||||
|
||||
If multiple strings are provided as argument they are each decoded
|
||||
separately and the same number of strings are returned.
|
||||
|
||||
If called in void context the arguments are decoded in-place.
|
||||
|
||||
This routine is exported by default.
|
||||
|
||||
=item _decode_entities( $string, \%entity2char )
|
||||
|
||||
=item _decode_entities( $string, \%entity2char, $expand_prefix )
|
||||
|
||||
This will in-place replace HTML entities in $string. The %entity2char
|
||||
hash must be provided. Named entities not found in the %entity2char
|
||||
hash are left alone. Numeric entities are expanded unless their value
|
||||
overflow.
|
||||
|
||||
The keys in %entity2char are the entity names to be expanded and their
|
||||
values are what they should expand into. The values do not have to be
|
||||
single character strings. If a key has ";" as suffix,
|
||||
then occurrences in $string are only expanded if properly terminated
|
||||
with ";". Entities without ";" will be expanded regardless of how
|
||||
they are terminated for compatibility with how common browsers treat
|
||||
entities in the Latin-1 range.
|
||||
|
||||
If $expand_prefix is TRUE then entities without trailing ";" in
|
||||
%entity2char will even be expanded as a prefix of a longer
|
||||
unrecognized name. The longest matching name in %entity2char will be
|
||||
used. This is mainly present for compatibility with an MSIE
|
||||
misfeature.
|
||||
|
||||
$string = "foo bar";
|
||||
_decode_entities($string, { nb => "@", nbsp => "\xA0" }, 1);
|
||||
print $string; # will print "foo bar"
|
||||
|
||||
This routine is exported by default.
|
||||
|
||||
=item encode_entities( $string )
|
||||
|
||||
=item encode_entities( $string, $unsafe_chars )
|
||||
|
||||
This routine replaces unsafe characters in $string with their entity
|
||||
representation. A second argument can be given to specify which characters to
|
||||
consider unsafe. The unsafe characters is specified using the regular
|
||||
expression character class syntax (what you find within brackets in regular
|
||||
expressions).
|
||||
|
||||
The default set of characters to encode are control chars, high-bit chars, and
|
||||
the C<< < >>, C<< & >>, C<< > >>, C<< ' >> and C<< " >> characters. But this,
|
||||
for example, would encode I<just> the C<< < >>, C<< & >>, C<< > >>, and C<< "
|
||||
>> characters:
|
||||
|
||||
$encoded = encode_entities($input, '<>&"');
|
||||
|
||||
and this would only encode non-plain ASCII:
|
||||
|
||||
$encoded = encode_entities($input, '^\n\x20-\x25\x27-\x7e');
|
||||
|
||||
This routine is exported by default.
|
||||
|
||||
=item encode_entities_numeric( $string )
|
||||
|
||||
=item encode_entities_numeric( $string, $unsafe_chars )
|
||||
|
||||
This routine works just like encode_entities, except that the replacement
|
||||
entities are always C<&#xI<hexnum>;> and never C<&I<entname>;>. For
|
||||
example, C<encode_entities("r\xF4le")> returns "rôle", but
|
||||
C<encode_entities_numeric("r\xF4le")> returns "rôle".
|
||||
|
||||
This routine is I<not> exported by default. But you can always
|
||||
export it with C<use HTML::Entities qw(encode_entities_numeric);>
|
||||
or even C<use HTML::Entities qw(:DEFAULT encode_entities_numeric);>
|
||||
|
||||
=back
|
||||
|
||||
All these routines modify the string passed as the first argument, if
|
||||
called in a void context. In scalar and array contexts, the encoded or
|
||||
decoded string is returned (without changing the input string).
|
||||
|
||||
If you prefer not to import these routines into your namespace, you can
|
||||
call them as:
|
||||
|
||||
use HTML::Entities ();
|
||||
$decoded = HTML::Entities::decode($a);
|
||||
$encoded = HTML::Entities::encode($a);
|
||||
$encoded = HTML::Entities::encode_numeric($a);
|
||||
|
||||
The module can also export the %char2entity and the %entity2char
|
||||
hashes, which contain the mapping from all characters to the
|
||||
corresponding entities (and vice versa, respectively).
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1995-2006 Gisle Aas. All rights reserved.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
|
||||
use strict;
|
||||
our $VERSION = '3.75';
|
||||
use vars qw(%entity2char %char2entity);
|
||||
|
||||
require 5.004;
|
||||
require Exporter;
|
||||
our @ISA = qw(Exporter);
|
||||
|
||||
our @EXPORT = qw(encode_entities decode_entities _decode_entities);
|
||||
our @EXPORT_OK = qw(%entity2char %char2entity encode_entities_numeric);
|
||||
|
||||
sub Version { $VERSION; }
|
||||
|
||||
require HTML::Parser; # for fast XS implemented decode_entities
|
||||
|
||||
|
||||
%entity2char = (
|
||||
# Some normal chars that have special meaning in SGML context
|
||||
amp => '&', # ampersand
|
||||
'gt' => '>', # greater than
|
||||
'lt' => '<', # less than
|
||||
quot => '"', # double quote
|
||||
apos => "'", # single quote
|
||||
|
||||
# PUBLIC ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML
|
||||
AElig => chr(198), # capital AE diphthong (ligature)
|
||||
Aacute => chr(193), # capital A, acute accent
|
||||
Acirc => chr(194), # capital A, circumflex accent
|
||||
Agrave => chr(192), # capital A, grave accent
|
||||
Aring => chr(197), # capital A, ring
|
||||
Atilde => chr(195), # capital A, tilde
|
||||
Auml => chr(196), # capital A, dieresis or umlaut mark
|
||||
Ccedil => chr(199), # capital C, cedilla
|
||||
ETH => chr(208), # capital Eth, Icelandic
|
||||
Eacute => chr(201), # capital E, acute accent
|
||||
Ecirc => chr(202), # capital E, circumflex accent
|
||||
Egrave => chr(200), # capital E, grave accent
|
||||
Euml => chr(203), # capital E, dieresis or umlaut mark
|
||||
Iacute => chr(205), # capital I, acute accent
|
||||
Icirc => chr(206), # capital I, circumflex accent
|
||||
Igrave => chr(204), # capital I, grave accent
|
||||
Iuml => chr(207), # capital I, dieresis or umlaut mark
|
||||
Ntilde => chr(209), # capital N, tilde
|
||||
Oacute => chr(211), # capital O, acute accent
|
||||
Ocirc => chr(212), # capital O, circumflex accent
|
||||
Ograve => chr(210), # capital O, grave accent
|
||||
Oslash => chr(216), # capital O, slash
|
||||
Otilde => chr(213), # capital O, tilde
|
||||
Ouml => chr(214), # capital O, dieresis or umlaut mark
|
||||
THORN => chr(222), # capital THORN, Icelandic
|
||||
Uacute => chr(218), # capital U, acute accent
|
||||
Ucirc => chr(219), # capital U, circumflex accent
|
||||
Ugrave => chr(217), # capital U, grave accent
|
||||
Uuml => chr(220), # capital U, dieresis or umlaut mark
|
||||
Yacute => chr(221), # capital Y, acute accent
|
||||
aacute => chr(225), # small a, acute accent
|
||||
acirc => chr(226), # small a, circumflex accent
|
||||
aelig => chr(230), # small ae diphthong (ligature)
|
||||
agrave => chr(224), # small a, grave accent
|
||||
aring => chr(229), # small a, ring
|
||||
atilde => chr(227), # small a, tilde
|
||||
auml => chr(228), # small a, dieresis or umlaut mark
|
||||
ccedil => chr(231), # small c, cedilla
|
||||
eacute => chr(233), # small e, acute accent
|
||||
ecirc => chr(234), # small e, circumflex accent
|
||||
egrave => chr(232), # small e, grave accent
|
||||
eth => chr(240), # small eth, Icelandic
|
||||
euml => chr(235), # small e, dieresis or umlaut mark
|
||||
iacute => chr(237), # small i, acute accent
|
||||
icirc => chr(238), # small i, circumflex accent
|
||||
igrave => chr(236), # small i, grave accent
|
||||
iuml => chr(239), # small i, dieresis or umlaut mark
|
||||
ntilde => chr(241), # small n, tilde
|
||||
oacute => chr(243), # small o, acute accent
|
||||
ocirc => chr(244), # small o, circumflex accent
|
||||
ograve => chr(242), # small o, grave accent
|
||||
oslash => chr(248), # small o, slash
|
||||
otilde => chr(245), # small o, tilde
|
||||
ouml => chr(246), # small o, dieresis or umlaut mark
|
||||
szlig => chr(223), # small sharp s, German (sz ligature)
|
||||
thorn => chr(254), # small thorn, Icelandic
|
||||
uacute => chr(250), # small u, acute accent
|
||||
ucirc => chr(251), # small u, circumflex accent
|
||||
ugrave => chr(249), # small u, grave accent
|
||||
uuml => chr(252), # small u, dieresis or umlaut mark
|
||||
yacute => chr(253), # small y, acute accent
|
||||
yuml => chr(255), # small y, dieresis or umlaut mark
|
||||
|
||||
# Some extra Latin 1 chars that are listed in the HTML3.2 draft (21-May-96)
|
||||
copy => chr(169), # copyright sign
|
||||
reg => chr(174), # registered sign
|
||||
nbsp => chr(160), # non breaking space
|
||||
|
||||
# Additional ISO-8859/1 entities listed in rfc1866 (section 14)
|
||||
iexcl => chr(161),
|
||||
cent => chr(162),
|
||||
pound => chr(163),
|
||||
curren => chr(164),
|
||||
yen => chr(165),
|
||||
brvbar => chr(166),
|
||||
sect => chr(167),
|
||||
uml => chr(168),
|
||||
ordf => chr(170),
|
||||
laquo => chr(171),
|
||||
'not' => chr(172), # not is a keyword in perl
|
||||
shy => chr(173),
|
||||
macr => chr(175),
|
||||
deg => chr(176),
|
||||
plusmn => chr(177),
|
||||
sup1 => chr(185),
|
||||
sup2 => chr(178),
|
||||
sup3 => chr(179),
|
||||
acute => chr(180),
|
||||
micro => chr(181),
|
||||
para => chr(182),
|
||||
middot => chr(183),
|
||||
cedil => chr(184),
|
||||
ordm => chr(186),
|
||||
raquo => chr(187),
|
||||
frac14 => chr(188),
|
||||
frac12 => chr(189),
|
||||
frac34 => chr(190),
|
||||
iquest => chr(191),
|
||||
'times' => chr(215), # times is a keyword in perl
|
||||
divide => chr(247),
|
||||
|
||||
( $] > 5.007 ? (
|
||||
'OElig;' => chr(338),
|
||||
'oelig;' => chr(339),
|
||||
'Scaron;' => chr(352),
|
||||
'scaron;' => chr(353),
|
||||
'Yuml;' => chr(376),
|
||||
'fnof;' => chr(402),
|
||||
'circ;' => chr(710),
|
||||
'tilde;' => chr(732),
|
||||
'Alpha;' => chr(913),
|
||||
'Beta;' => chr(914),
|
||||
'Gamma;' => chr(915),
|
||||
'Delta;' => chr(916),
|
||||
'Epsilon;' => chr(917),
|
||||
'Zeta;' => chr(918),
|
||||
'Eta;' => chr(919),
|
||||
'Theta;' => chr(920),
|
||||
'Iota;' => chr(921),
|
||||
'Kappa;' => chr(922),
|
||||
'Lambda;' => chr(923),
|
||||
'Mu;' => chr(924),
|
||||
'Nu;' => chr(925),
|
||||
'Xi;' => chr(926),
|
||||
'Omicron;' => chr(927),
|
||||
'Pi;' => chr(928),
|
||||
'Rho;' => chr(929),
|
||||
'Sigma;' => chr(931),
|
||||
'Tau;' => chr(932),
|
||||
'Upsilon;' => chr(933),
|
||||
'Phi;' => chr(934),
|
||||
'Chi;' => chr(935),
|
||||
'Psi;' => chr(936),
|
||||
'Omega;' => chr(937),
|
||||
'alpha;' => chr(945),
|
||||
'beta;' => chr(946),
|
||||
'gamma;' => chr(947),
|
||||
'delta;' => chr(948),
|
||||
'epsilon;' => chr(949),
|
||||
'zeta;' => chr(950),
|
||||
'eta;' => chr(951),
|
||||
'theta;' => chr(952),
|
||||
'iota;' => chr(953),
|
||||
'kappa;' => chr(954),
|
||||
'lambda;' => chr(955),
|
||||
'mu;' => chr(956),
|
||||
'nu;' => chr(957),
|
||||
'xi;' => chr(958),
|
||||
'omicron;' => chr(959),
|
||||
'pi;' => chr(960),
|
||||
'rho;' => chr(961),
|
||||
'sigmaf;' => chr(962),
|
||||
'sigma;' => chr(963),
|
||||
'tau;' => chr(964),
|
||||
'upsilon;' => chr(965),
|
||||
'phi;' => chr(966),
|
||||
'chi;' => chr(967),
|
||||
'psi;' => chr(968),
|
||||
'omega;' => chr(969),
|
||||
'thetasym;' => chr(977),
|
||||
'upsih;' => chr(978),
|
||||
'piv;' => chr(982),
|
||||
'ensp;' => chr(8194),
|
||||
'emsp;' => chr(8195),
|
||||
'thinsp;' => chr(8201),
|
||||
'zwnj;' => chr(8204),
|
||||
'zwj;' => chr(8205),
|
||||
'lrm;' => chr(8206),
|
||||
'rlm;' => chr(8207),
|
||||
'ndash;' => chr(8211),
|
||||
'mdash;' => chr(8212),
|
||||
'lsquo;' => chr(8216),
|
||||
'rsquo;' => chr(8217),
|
||||
'sbquo;' => chr(8218),
|
||||
'ldquo;' => chr(8220),
|
||||
'rdquo;' => chr(8221),
|
||||
'bdquo;' => chr(8222),
|
||||
'dagger;' => chr(8224),
|
||||
'Dagger;' => chr(8225),
|
||||
'bull;' => chr(8226),
|
||||
'hellip;' => chr(8230),
|
||||
'permil;' => chr(8240),
|
||||
'prime;' => chr(8242),
|
||||
'Prime;' => chr(8243),
|
||||
'lsaquo;' => chr(8249),
|
||||
'rsaquo;' => chr(8250),
|
||||
'oline;' => chr(8254),
|
||||
'frasl;' => chr(8260),
|
||||
'euro;' => chr(8364),
|
||||
'image;' => chr(8465),
|
||||
'weierp;' => chr(8472),
|
||||
'real;' => chr(8476),
|
||||
'trade;' => chr(8482),
|
||||
'alefsym;' => chr(8501),
|
||||
'larr;' => chr(8592),
|
||||
'uarr;' => chr(8593),
|
||||
'rarr;' => chr(8594),
|
||||
'darr;' => chr(8595),
|
||||
'harr;' => chr(8596),
|
||||
'crarr;' => chr(8629),
|
||||
'lArr;' => chr(8656),
|
||||
'uArr;' => chr(8657),
|
||||
'rArr;' => chr(8658),
|
||||
'dArr;' => chr(8659),
|
||||
'hArr;' => chr(8660),
|
||||
'forall;' => chr(8704),
|
||||
'part;' => chr(8706),
|
||||
'exist;' => chr(8707),
|
||||
'empty;' => chr(8709),
|
||||
'nabla;' => chr(8711),
|
||||
'isin;' => chr(8712),
|
||||
'notin;' => chr(8713),
|
||||
'ni;' => chr(8715),
|
||||
'prod;' => chr(8719),
|
||||
'sum;' => chr(8721),
|
||||
'minus;' => chr(8722),
|
||||
'lowast;' => chr(8727),
|
||||
'radic;' => chr(8730),
|
||||
'prop;' => chr(8733),
|
||||
'infin;' => chr(8734),
|
||||
'ang;' => chr(8736),
|
||||
'and;' => chr(8743),
|
||||
'or;' => chr(8744),
|
||||
'cap;' => chr(8745),
|
||||
'cup;' => chr(8746),
|
||||
'int;' => chr(8747),
|
||||
'there4;' => chr(8756),
|
||||
'sim;' => chr(8764),
|
||||
'cong;' => chr(8773),
|
||||
'asymp;' => chr(8776),
|
||||
'ne;' => chr(8800),
|
||||
'equiv;' => chr(8801),
|
||||
'le;' => chr(8804),
|
||||
'ge;' => chr(8805),
|
||||
'sub;' => chr(8834),
|
||||
'sup;' => chr(8835),
|
||||
'nsub;' => chr(8836),
|
||||
'sube;' => chr(8838),
|
||||
'supe;' => chr(8839),
|
||||
'oplus;' => chr(8853),
|
||||
'otimes;' => chr(8855),
|
||||
'perp;' => chr(8869),
|
||||
'sdot;' => chr(8901),
|
||||
'lceil;' => chr(8968),
|
||||
'rceil;' => chr(8969),
|
||||
'lfloor;' => chr(8970),
|
||||
'rfloor;' => chr(8971),
|
||||
'lang;' => chr(9001),
|
||||
'rang;' => chr(9002),
|
||||
'loz;' => chr(9674),
|
||||
'spades;' => chr(9824),
|
||||
'clubs;' => chr(9827),
|
||||
'hearts;' => chr(9829),
|
||||
'diams;' => chr(9830),
|
||||
) : ())
|
||||
);
|
||||
|
||||
|
||||
# Make the opposite mapping
|
||||
while (my($entity, $char) = each(%entity2char)) {
|
||||
$entity =~ s/;\z//;
|
||||
$char2entity{$char} = "&$entity;";
|
||||
}
|
||||
delete $char2entity{"'"}; # only one-way decoding
|
||||
|
||||
# Fill in missing entities
|
||||
for (0 .. 255) {
|
||||
next if exists $char2entity{chr($_)};
|
||||
$char2entity{chr($_)} = "&#$_;";
|
||||
}
|
||||
|
||||
my %subst; # compiled encoding regexps
|
||||
|
||||
sub encode_entities
|
||||
{
|
||||
return undef unless defined $_[0];
|
||||
my $ref;
|
||||
if (defined wantarray) {
|
||||
my $x = $_[0];
|
||||
$ref = \$x; # copy
|
||||
} else {
|
||||
$ref = \$_[0]; # modify in-place
|
||||
}
|
||||
if (defined $_[1] and length $_[1]) {
|
||||
unless (exists $subst{$_[1]}) {
|
||||
# Because we can't compile regex we fake it with a cached sub
|
||||
my $chars = $_[1];
|
||||
$chars =~ s,(?<!\\)([]/]),\\$1,g;
|
||||
$chars =~ s,(?<!\\)\\\z,\\\\,;
|
||||
my $code = "sub {\$_[0] =~ s/([$chars])/\$char2entity{\$1} || num_entity(\$1)/ge; }";
|
||||
$subst{$_[1]} = eval $code;
|
||||
die( $@ . " while trying to turn range: \"$_[1]\"\n "
|
||||
. "into code: $code\n "
|
||||
) if $@;
|
||||
}
|
||||
&{$subst{$_[1]}}($$ref);
|
||||
} else {
|
||||
# Encode control chars, high bit chars and '<', '&', '>', ''' and '"'
|
||||
$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;
|
||||
}
|
||||
$$ref;
|
||||
}
|
||||
|
||||
sub encode_entities_numeric {
|
||||
local %char2entity;
|
||||
return &encode_entities; # a goto &encode_entities wouldn't work
|
||||
}
|
||||
|
||||
|
||||
sub num_entity {
|
||||
sprintf "&#x%X;", ord($_[0]);
|
||||
}
|
||||
|
||||
# Set up aliases
|
||||
*encode = \&encode_entities;
|
||||
*encode_numeric = \&encode_entities_numeric;
|
||||
*encode_numerically = \&encode_entities_numeric;
|
||||
*decode = \&decode_entities;
|
||||
|
||||
1;
|
||||
110
database/perl/vendor/lib/HTML/Filter.pm
vendored
Normal file
110
database/perl/vendor/lib/HTML/Filter.pm
vendored
Normal file
@@ -0,0 +1,110 @@
|
||||
package HTML::Filter;
|
||||
|
||||
use strict;
|
||||
|
||||
require HTML::Parser;
|
||||
our @ISA = qw(HTML::Parser);
|
||||
our $VERSION = '3.75';
|
||||
|
||||
sub declaration { $_[0]->output("<!$_[1]>") }
|
||||
sub process { $_[0]->output($_[2]) }
|
||||
sub comment { $_[0]->output("<!--$_[1]-->") }
|
||||
sub start { $_[0]->output($_[4]) }
|
||||
sub end { $_[0]->output($_[2]) }
|
||||
sub text { $_[0]->output($_[1]) }
|
||||
|
||||
sub output { print $_[1] }
|
||||
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Filter - Filter HTML text through the parser
|
||||
|
||||
=head1 NOTE
|
||||
|
||||
B<This module is deprecated.> The C<HTML::Parser> now provides the
|
||||
functionally of C<HTML::Filter> much more efficiently with the
|
||||
C<default> handler.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
require HTML::Filter;
|
||||
$p = HTML::Filter->new->parse_file("index.html");
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
C<HTML::Filter> is an HTML parser that by default prints the
|
||||
original text of each HTML element (a slow version of cat(1) basically).
|
||||
The callback methods may be overridden to modify the filtering for some
|
||||
HTML elements and you can override output() method which is called to
|
||||
print the HTML text.
|
||||
|
||||
C<HTML::Filter> is a subclass of C<HTML::Parser>. This means that
|
||||
the document should be given to the parser by calling the $p->parse()
|
||||
or $p->parse_file() methods.
|
||||
|
||||
=head1 EXAMPLES
|
||||
|
||||
The first example is a filter that will remove all comments from an
|
||||
HTML file. This is achieved by simply overriding the comment method
|
||||
to do nothing.
|
||||
|
||||
package CommentStripper;
|
||||
require HTML::Filter;
|
||||
@ISA=qw(HTML::Filter);
|
||||
sub comment { } # ignore comments
|
||||
|
||||
The second example shows a filter that will remove any E<lt>TABLE>s
|
||||
found in the HTML file. We specialize the start() and end() methods
|
||||
to count table tags and then make output not happen when inside a
|
||||
table.
|
||||
|
||||
package TableStripper;
|
||||
require HTML::Filter;
|
||||
@ISA=qw(HTML::Filter);
|
||||
sub start
|
||||
{
|
||||
my $self = shift;
|
||||
$self->{table_seen}++ if $_[0] eq "table";
|
||||
$self->SUPER::start(@_);
|
||||
}
|
||||
|
||||
sub end
|
||||
{
|
||||
my $self = shift;
|
||||
$self->SUPER::end(@_);
|
||||
$self->{table_seen}-- if $_[0] eq "table";
|
||||
}
|
||||
|
||||
sub output
|
||||
{
|
||||
my $self = shift;
|
||||
unless ($self->{table_seen}) {
|
||||
$self->SUPER::output(@_);
|
||||
}
|
||||
}
|
||||
|
||||
If you want to collect the parsed text internally you might want to do
|
||||
something like this:
|
||||
|
||||
package FilterIntoString;
|
||||
require HTML::Filter;
|
||||
@ISA=qw(HTML::Filter);
|
||||
sub output { push(@{$_[0]->{fhtml}}, $_[1]) }
|
||||
sub filtered_html { join("", @{$_[0]->{fhtml}}) }
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Parser>
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1997-1999 Gisle Aas.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
1580
database/perl/vendor/lib/HTML/Form.pm
vendored
Normal file
1580
database/perl/vendor/lib/HTML/Form.pm
vendored
Normal file
File diff suppressed because it is too large
Load Diff
314
database/perl/vendor/lib/HTML/HeadParser.pm
vendored
Normal file
314
database/perl/vendor/lib/HTML/HeadParser.pm
vendored
Normal file
@@ -0,0 +1,314 @@
|
||||
package HTML::HeadParser;
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::HeadParser - Parse <HEAD> section of a HTML document
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
require HTML::HeadParser;
|
||||
$p = HTML::HeadParser->new;
|
||||
$p->parse($text) and print "not finished";
|
||||
|
||||
$p->header('Title') # to access <title>....</title>
|
||||
$p->header('Content-Base') # to access <base href="http://...">
|
||||
$p->header('Foo') # to access <meta http-equiv="Foo" content="...">
|
||||
$p->header('X-Meta-Author') # to access <meta name="author" content="...">
|
||||
$p->header('X-Meta-Charset') # to access <meta charset="...">
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The C<HTML::HeadParser> is a specialized (and lightweight)
|
||||
C<HTML::Parser> that will only parse the E<lt>HEAD>...E<lt>/HEAD>
|
||||
section of an HTML document. The parse() method
|
||||
will return a FALSE value as soon as some E<lt>BODY> element or body
|
||||
text are found, and should not be called again after this.
|
||||
|
||||
Note that the C<HTML::HeadParser> might get confused if raw undecoded
|
||||
UTF-8 is passed to the parse() method. Make sure the strings are
|
||||
properly decoded before passing them on.
|
||||
|
||||
The C<HTML::HeadParser> keeps a reference to a header object, and the
|
||||
parser will update this header object as the various elements of the
|
||||
E<lt>HEAD> section of the HTML document are recognized. The following
|
||||
header fields are affected:
|
||||
|
||||
=over 4
|
||||
|
||||
=item Content-Base:
|
||||
|
||||
The I<Content-Base> header is initialized from the E<lt>base
|
||||
href="..."> element.
|
||||
|
||||
=item Title:
|
||||
|
||||
The I<Title> header is initialized from the E<lt>title>...E<lt>/title>
|
||||
element.
|
||||
|
||||
=item Isindex:
|
||||
|
||||
The I<Isindex> header will be added if there is a E<lt>isindex>
|
||||
element in the E<lt>head>. The header value is initialized from the
|
||||
I<prompt> attribute if it is present. If no I<prompt> attribute is
|
||||
given it will have '?' as the value.
|
||||
|
||||
=item X-Meta-Foo:
|
||||
|
||||
All E<lt>meta> elements containing a C<name> attribute will result in
|
||||
headers using the prefix C<X-Meta-> appended with the value of the
|
||||
C<name> attribute as the name of the header, and the value of the
|
||||
C<content> attribute as the pushed header value.
|
||||
|
||||
E<lt>meta> elements containing a C<http-equiv> attribute will result
|
||||
in headers as in above, but without the C<X-Meta-> prefix in the
|
||||
header name.
|
||||
|
||||
E<lt>meta> elements containing a C<charset> attribute will result in
|
||||
an C<X-Meta-Charset> header, using the value of the C<charset>
|
||||
attribute as the pushed header value.
|
||||
|
||||
The ':' character can't be represented in header field names, so
|
||||
if the meta element contains this char it's substituted with '-'
|
||||
before forming the field name.
|
||||
|
||||
=back
|
||||
|
||||
=head1 METHODS
|
||||
|
||||
The following methods (in addition to those provided by the
|
||||
superclass) are available:
|
||||
|
||||
=over 4
|
||||
|
||||
=cut
|
||||
|
||||
|
||||
require HTML::Parser;
|
||||
our @ISA = qw(HTML::Parser);
|
||||
|
||||
use HTML::Entities ();
|
||||
|
||||
use strict;
|
||||
use vars qw($DEBUG);
|
||||
#$DEBUG = 1;
|
||||
our $VERSION = '3.75';
|
||||
|
||||
=item $hp = HTML::HeadParser->new
|
||||
|
||||
=item $hp = HTML::HeadParser->new( $header )
|
||||
|
||||
The object constructor. The optional $header argument should be a
|
||||
reference to an object that implement the header() and push_header()
|
||||
methods as defined by the C<HTTP::Headers> class. Normally it will be
|
||||
of some class that is a or delegates to the C<HTTP::Headers> class.
|
||||
|
||||
If no $header is given C<HTML::HeadParser> will create an
|
||||
C<HTTP::Headers> object by itself (initially empty).
|
||||
|
||||
=cut
|
||||
|
||||
sub new
|
||||
{
|
||||
my($class, $header) = @_;
|
||||
unless ($header) {
|
||||
require HTTP::Headers;
|
||||
$header = HTTP::Headers->new;
|
||||
}
|
||||
|
||||
my $self = $class->SUPER::new(api_version => 3,
|
||||
start_h => ["start", "self,tagname,attr"],
|
||||
end_h => ["end", "self,tagname"],
|
||||
text_h => ["text", "self,text"],
|
||||
ignore_elements => [qw(script style)],
|
||||
);
|
||||
$self->{'header'} = $header;
|
||||
$self->{'tag'} = ''; # name of active element that takes textual content
|
||||
$self->{'text'} = ''; # the accumulated text associated with the element
|
||||
$self;
|
||||
}
|
||||
|
||||
=item $hp->header;
|
||||
|
||||
Returns a reference to the header object.
|
||||
|
||||
=item $hp->header( $key )
|
||||
|
||||
Returns a header value. It is just a shorter way to write
|
||||
C<$hp-E<gt>header-E<gt>header($key)>.
|
||||
|
||||
=cut
|
||||
|
||||
sub header
|
||||
{
|
||||
my $self = shift;
|
||||
return $self->{'header'} unless @_;
|
||||
$self->{'header'}->header(@_);
|
||||
}
|
||||
|
||||
sub as_string # legacy
|
||||
{
|
||||
my $self = shift;
|
||||
$self->{'header'}->as_string;
|
||||
}
|
||||
|
||||
sub flush_text # internal
|
||||
{
|
||||
my $self = shift;
|
||||
my $tag = $self->{'tag'};
|
||||
my $text = $self->{'text'};
|
||||
$text =~ s/^\s+//;
|
||||
$text =~ s/\s+$//;
|
||||
$text =~ s/\s+/ /g;
|
||||
print "FLUSH $tag => '$text'\n" if $DEBUG;
|
||||
if ($tag eq 'title') {
|
||||
my $decoded;
|
||||
$decoded = utf8::decode($text) if $self->utf8_mode && defined &utf8::decode;
|
||||
HTML::Entities::decode($text);
|
||||
utf8::encode($text) if $decoded;
|
||||
$self->{'header'}->push_header(Title => $text);
|
||||
}
|
||||
$self->{'tag'} = $self->{'text'} = '';
|
||||
}
|
||||
|
||||
# This is an quote from the HTML3.2 DTD which shows which elements
|
||||
# that might be present in a <HEAD>...</HEAD>. Also note that the
|
||||
# <HEAD> tags themselves might be missing:
|
||||
#
|
||||
# <!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &
|
||||
# SCRIPT* & META* & LINK*">
|
||||
#
|
||||
# <!ELEMENT HEAD O O (%head.content)>
|
||||
#
|
||||
# From HTML 4.01:
|
||||
#
|
||||
# <!ENTITY % head.misc "SCRIPT|STYLE|META|LINK|OBJECT">
|
||||
# <!ENTITY % head.content "TITLE & BASE?">
|
||||
# <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)>
|
||||
#
|
||||
# From HTML 5 as of WD-html5-20090825:
|
||||
#
|
||||
# One or more elements of metadata content, [...]
|
||||
# => base, command, link, meta, noscript, script, style, title
|
||||
|
||||
sub start
|
||||
{
|
||||
my($self, $tag, $attr) = @_; # $attr is reference to a HASH
|
||||
print "START[$tag]\n" if $DEBUG;
|
||||
$self->flush_text if $self->{'tag'};
|
||||
if ($tag eq 'meta') {
|
||||
my $key = $attr->{'http-equiv'};
|
||||
if (!defined($key) || !length($key)) {
|
||||
if ($attr->{name}) {
|
||||
$key = "X-Meta-\u$attr->{name}";
|
||||
} elsif ($attr->{charset}) { # HTML 5 <meta charset="...">
|
||||
$key = "X-Meta-Charset";
|
||||
$self->{header}->push_header($key => $attr->{charset});
|
||||
return;
|
||||
} else {
|
||||
return;
|
||||
}
|
||||
}
|
||||
$key =~ s/:/-/g;
|
||||
$self->{'header'}->push_header($key => $attr->{content});
|
||||
} elsif ($tag eq 'base') {
|
||||
return unless exists $attr->{href};
|
||||
(my $base = $attr->{href}) =~ s/^\s+//; $base =~ s/\s+$//; # HTML5
|
||||
$self->{'header'}->push_header('Content-Base' => $base);
|
||||
} elsif ($tag eq 'isindex') {
|
||||
# This is a non-standard header. Perhaps we should just ignore
|
||||
# this element
|
||||
$self->{'header'}->push_header(Isindex => $attr->{prompt} || '?');
|
||||
} elsif ($tag =~ /^(?:title|noscript|object|command)$/) {
|
||||
# Just remember tag. Initialize header when we see the end tag.
|
||||
$self->{'tag'} = $tag;
|
||||
} elsif ($tag eq 'link') {
|
||||
return unless exists $attr->{href};
|
||||
# <link href="http:..." rel="xxx" rev="xxx" title="xxx">
|
||||
my $href = delete($attr->{href});
|
||||
$href =~ s/^\s+//; $href =~ s/\s+$//; # HTML5
|
||||
my $h_val = "<$href>";
|
||||
for (sort keys %{$attr}) {
|
||||
next if $_ eq "/"; # XHTML junk
|
||||
$h_val .= qq(; $_="$attr->{$_}");
|
||||
}
|
||||
$self->{'header'}->push_header(Link => $h_val);
|
||||
} elsif ($tag eq 'head' || $tag eq 'html') {
|
||||
# ignore
|
||||
} else {
|
||||
# stop parsing
|
||||
$self->eof;
|
||||
}
|
||||
}
|
||||
|
||||
sub end
|
||||
{
|
||||
my($self, $tag) = @_;
|
||||
print "END[$tag]\n" if $DEBUG;
|
||||
$self->flush_text if $self->{'tag'};
|
||||
$self->eof if $tag eq 'head';
|
||||
}
|
||||
|
||||
sub text
|
||||
{
|
||||
my($self, $text) = @_;
|
||||
print "TEXT[$text]\n" if $DEBUG;
|
||||
unless ($self->{first_chunk}) {
|
||||
# drop Unicode BOM if found
|
||||
if ($self->utf8_mode) {
|
||||
$text =~ s/^\xEF\xBB\xBF//;
|
||||
}
|
||||
else {
|
||||
$text =~ s/^\x{FEFF}//;
|
||||
}
|
||||
$self->{first_chunk}++;
|
||||
}
|
||||
my $tag = $self->{tag};
|
||||
if (!$tag && $text =~ /\S/) {
|
||||
# Normal text means start of body
|
||||
$self->eof;
|
||||
return;
|
||||
}
|
||||
return if $tag ne 'title';
|
||||
$self->{'text'} .= $text;
|
||||
}
|
||||
|
||||
BEGIN {
|
||||
*utf8_mode = sub { 1 } unless HTML::Entities::UNICODE_SUPPORT;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=back
|
||||
|
||||
=head1 EXAMPLE
|
||||
|
||||
$h = HTTP::Headers->new;
|
||||
$p = HTML::HeadParser->new($h);
|
||||
$p->parse(<<EOT);
|
||||
<title>Stupid example</title>
|
||||
<base href="http://www.linpro.no/lwp/">
|
||||
Normal text starts here.
|
||||
EOT
|
||||
undef $p;
|
||||
print $h->title; # should print "Stupid example"
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Parser>, L<HTTP::Headers>
|
||||
|
||||
The C<HTTP::Headers> class is distributed as part of the
|
||||
I<libwww-perl> package. If you don't have that distribution installed
|
||||
you need to provide the $header argument to the C<HTML::HeadParser>
|
||||
constructor with your own object that implements the documented
|
||||
protocol.
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1996-2001 Gisle Aas. All rights reserved.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
185
database/perl/vendor/lib/HTML/LinkExtor.pm
vendored
Normal file
185
database/perl/vendor/lib/HTML/LinkExtor.pm
vendored
Normal file
@@ -0,0 +1,185 @@
|
||||
package HTML::LinkExtor;
|
||||
|
||||
require HTML::Parser;
|
||||
our @ISA = qw(HTML::Parser);
|
||||
our $VERSION = '3.75';
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::LinkExtor - Extract links from an HTML document
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
require HTML::LinkExtor;
|
||||
$p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/");
|
||||
sub cb {
|
||||
my($tag, %links) = @_;
|
||||
print "$tag @{[%links]}\n";
|
||||
}
|
||||
$p->parse_file("index.html");
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
I<HTML::LinkExtor> is an HTML parser that extracts links from an
|
||||
HTML document. The I<HTML::LinkExtor> is a subclass of
|
||||
I<HTML::Parser>. This means that the document should be given to the
|
||||
parser by calling the $p->parse() or $p->parse_file() methods.
|
||||
|
||||
=cut
|
||||
|
||||
use strict;
|
||||
use HTML::Tagset ();
|
||||
|
||||
# legacy (some applications grabs this hash directly)
|
||||
use vars qw(%LINK_ELEMENT);
|
||||
*LINK_ELEMENT = \%HTML::Tagset::linkElements;
|
||||
|
||||
=over 4
|
||||
|
||||
=item $p = HTML::LinkExtor->new
|
||||
|
||||
=item $p = HTML::LinkExtor->new( $callback )
|
||||
|
||||
=item $p = HTML::LinkExtor->new( $callback, $base )
|
||||
|
||||
The constructor takes two optional arguments. The first is a reference
|
||||
to a callback routine. It will be called as links are found. If a
|
||||
callback is not provided, then links are just accumulated internally
|
||||
and can be retrieved by calling the $p->links() method.
|
||||
|
||||
The $base argument is an optional base URL used to absolutize all URLs found.
|
||||
You need to have the I<URI> module installed if you provide $base.
|
||||
|
||||
The callback is called with the lowercase tag name as first argument,
|
||||
and then all link attributes as separate key/value pairs. All
|
||||
non-link attributes are removed.
|
||||
|
||||
=cut
|
||||
|
||||
sub new
|
||||
{
|
||||
my($class, $cb, $base) = @_;
|
||||
my $self = $class->SUPER::new(
|
||||
start_h => ["_start_tag", "self,tagname,attr"],
|
||||
report_tags => [keys %HTML::Tagset::linkElements],
|
||||
);
|
||||
$self->{extractlink_cb} = $cb;
|
||||
if ($base) {
|
||||
require URI;
|
||||
$self->{extractlink_base} = URI->new($base);
|
||||
}
|
||||
$self;
|
||||
}
|
||||
|
||||
sub _start_tag
|
||||
{
|
||||
my($self, $tag, $attr) = @_;
|
||||
|
||||
my $base = $self->{extractlink_base};
|
||||
my $links = $HTML::Tagset::linkElements{$tag};
|
||||
$links = [$links] unless ref $links;
|
||||
|
||||
my @links;
|
||||
my $a;
|
||||
for $a (@$links) {
|
||||
next unless exists $attr->{$a};
|
||||
(my $link = $attr->{$a}) =~ s/^\s+//; $link =~ s/\s+$//; # HTML5
|
||||
push(@links, $a, $base ? URI->new($link, $base)->abs($base) : $link);
|
||||
}
|
||||
return unless @links;
|
||||
$self->_found_link($tag, @links);
|
||||
}
|
||||
|
||||
sub _found_link
|
||||
{
|
||||
my $self = shift;
|
||||
my $cb = $self->{extractlink_cb};
|
||||
if ($cb) {
|
||||
&$cb(@_);
|
||||
} else {
|
||||
push(@{$self->{'links'}}, [@_]);
|
||||
}
|
||||
}
|
||||
|
||||
=item $p->links
|
||||
|
||||
Returns a list of all links found in the document. The returned
|
||||
values will be anonymous arrays with the following elements:
|
||||
|
||||
[$tag, $attr => $url1, $attr2 => $url2,...]
|
||||
|
||||
The $p->links method will also truncate the internal link list. This
|
||||
means that if the method is called twice without any parsing
|
||||
between them the second call will return an empty list.
|
||||
|
||||
Also note that $p->links will always be empty if a callback routine
|
||||
was provided when the I<HTML::LinkExtor> was created.
|
||||
|
||||
=cut
|
||||
|
||||
sub links
|
||||
{
|
||||
my $self = shift;
|
||||
exists($self->{'links'}) ? @{delete $self->{'links'}} : ();
|
||||
}
|
||||
|
||||
# We override the parse_file() method so that we can clear the links
|
||||
# before we start a new file.
|
||||
sub parse_file
|
||||
{
|
||||
my $self = shift;
|
||||
delete $self->{'links'};
|
||||
$self->SUPER::parse_file(@_);
|
||||
}
|
||||
|
||||
=back
|
||||
|
||||
=head1 EXAMPLE
|
||||
|
||||
This is an example showing how you can extract links from a document
|
||||
received using LWP:
|
||||
|
||||
use LWP::UserAgent;
|
||||
use HTML::LinkExtor;
|
||||
use URI::URL;
|
||||
|
||||
$url = "http://www.perl.org/"; # for instance
|
||||
$ua = LWP::UserAgent->new;
|
||||
|
||||
# Set up a callback that collect image links
|
||||
my @imgs = ();
|
||||
sub callback {
|
||||
my($tag, %attr) = @_;
|
||||
return if $tag ne 'img'; # we only look closer at <img ...>
|
||||
push(@imgs, values %attr);
|
||||
}
|
||||
|
||||
# Make the parser. Unfortunately, we don't know the base yet
|
||||
# (it might be different from $url)
|
||||
$p = HTML::LinkExtor->new(\&callback);
|
||||
|
||||
# Request document and parse it as it arrives
|
||||
$res = $ua->request(HTTP::Request->new(GET => $url),
|
||||
sub {$p->parse($_[0])});
|
||||
|
||||
# Expand all image URLs to absolute ones
|
||||
my $base = $res->base;
|
||||
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
|
||||
|
||||
# Print them out
|
||||
print join("\n", @imgs), "\n";
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Parser>, L<HTML::Tagset>, L<LWP>, L<URI::URL>
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1996-2001 Gisle Aas.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
|
||||
1;
|
||||
196
database/perl/vendor/lib/HTML/Parse.pm
vendored
Normal file
196
database/perl/vendor/lib/HTML/Parse.pm
vendored
Normal file
@@ -0,0 +1,196 @@
|
||||
package HTML::Parse;
|
||||
|
||||
use 5.008;
|
||||
|
||||
#ABSTRACT: Deprecated, a wrapper around HTML::TreeBuilder
|
||||
|
||||
|
||||
use warnings;
|
||||
use strict;
|
||||
|
||||
our $VERSION = '5.07'; # VERSION from OurPkgVersion
|
||||
|
||||
use vars qw(@ISA @EXPORT
|
||||
$IMPLICIT_TAGS $IGNORE_UNKNOWN $IGNORE_TEXT $WARN
|
||||
);
|
||||
|
||||
require Exporter;
|
||||
@ISA = qw(Exporter);
|
||||
@EXPORT = qw(parse_html parse_htmlfile);
|
||||
|
||||
# Backwards compatability
|
||||
$IMPLICIT_TAGS = 1;
|
||||
$IGNORE_UNKNOWN = 1;
|
||||
$IGNORE_TEXT = 0;
|
||||
$WARN = 0;
|
||||
|
||||
require HTML::TreeBuilder;
|
||||
|
||||
sub parse_html {
|
||||
my $p = $_[1];
|
||||
$p = _new_tree_maker() unless $p;
|
||||
$p->parse( $_[0] );
|
||||
}
|
||||
|
||||
sub parse_htmlfile {
|
||||
my ( $file, $p ) = @_;
|
||||
my ($HTML);
|
||||
open( $HTML, "<", $file ) or return;
|
||||
$p = _new_tree_maker() unless $p;
|
||||
$p->parse_file($HTML);
|
||||
}
|
||||
|
||||
sub _new_tree_maker {
|
||||
my $p = HTML::TreeBuilder->new(
|
||||
implicit_tags => $IMPLICIT_TAGS,
|
||||
ignore_unknown => $IGNORE_UNKNOWN,
|
||||
ignore_text => $IGNORE_TEXT,
|
||||
'warn' => $WARN,
|
||||
);
|
||||
$p->strict_comment(1);
|
||||
$p;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=pod
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Parse - Deprecated, a wrapper around HTML::TreeBuilder
|
||||
|
||||
=head1 VERSION
|
||||
|
||||
This document describes version 5.07 of
|
||||
HTML::Parse, released August 31, 2017
|
||||
as part of L<HTML-Tree|HTML::Tree>.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
See the documentation for HTML::TreeBuilder
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
Disclaimer: This module is provided only for backwards compatibility
|
||||
with earlier versions of this library. New code should I<not> use
|
||||
this module, and should really use the HTML::Parser and
|
||||
HTML::TreeBuilder modules directly, instead.
|
||||
|
||||
The C<HTML::Parse> module provides functions to parse HTML documents.
|
||||
There are two functions exported by this module:
|
||||
|
||||
=over 4
|
||||
|
||||
=item parse_html($html) or parse_html($html, $obj)
|
||||
|
||||
This function is really just a synonym for $obj->parse($html) and $obj
|
||||
is assumed to be a subclass of C<HTML::Parser>. Refer to
|
||||
L<HTML::Parser> for more documentation.
|
||||
|
||||
If $obj is not specified, the $obj will default to an internally
|
||||
created new C<HTML::TreeBuilder> object configured with strict_comment()
|
||||
turned on. That class implements a parser that builds (and is) a HTML
|
||||
syntax tree with HTML::Element objects as nodes.
|
||||
|
||||
The return value from parse_html() is $obj.
|
||||
|
||||
=item parse_htmlfile($file, [$obj])
|
||||
|
||||
Same as parse_html(), but pulls the HTML to parse, from the named file.
|
||||
|
||||
Returns C<undef> if the file could not be opened, or $obj otherwise.
|
||||
|
||||
=back
|
||||
|
||||
When a C<HTML::TreeBuilder> object is created, the following variables
|
||||
control how parsing takes place:
|
||||
|
||||
=over 4
|
||||
|
||||
=item $HTML::Parse::IMPLICIT_TAGS
|
||||
|
||||
Setting this variable to true will instruct the parser to try to
|
||||
deduce implicit elements and implicit end tags. If this variable is
|
||||
false you get a parse tree that just reflects the text as it stands.
|
||||
Might be useful for quick & dirty parsing. Default is true.
|
||||
|
||||
Implicit elements have the implicit() attribute set.
|
||||
|
||||
=item $HTML::Parse::IGNORE_UNKNOWN
|
||||
|
||||
This variable contols whether unknow tags should be represented as
|
||||
elements in the parse tree. Default is true.
|
||||
|
||||
=item $HTML::Parse::IGNORE_TEXT
|
||||
|
||||
Do not represent the text content of elements. This saves space if
|
||||
all you want is to examine the structure of the document. Default is
|
||||
false.
|
||||
|
||||
=item $HTML::Parse::WARN
|
||||
|
||||
Call warn() with an appropriate message for syntax errors. Default is
|
||||
false.
|
||||
|
||||
=back
|
||||
|
||||
=head1 REMEMBER!
|
||||
|
||||
HTML::TreeBuilder objects should be explicitly destroyed when you're
|
||||
finished with them. See L<HTML::TreeBuilder>.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Parser>, L<HTML::TreeBuilder>, L<HTML::Element>
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Current maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
|
||||
|
||||
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
Original HTML-Tree author:
|
||||
|
||||
=over
|
||||
|
||||
=item * Gisle Aas
|
||||
|
||||
=back
|
||||
|
||||
Former maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Sean M. Burke
|
||||
|
||||
=item * Andy Lester
|
||||
|
||||
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
You can follow or contribute to HTML-Tree's development at
|
||||
L<< https://github.com/kentfredric/HTML-Tree >>.
|
||||
|
||||
=head1 COPYRIGHT AND LICENSE
|
||||
|
||||
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
|
||||
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
|
||||
2012 Christopher J. Madsen.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
The programs in this library are distributed in the hope that they
|
||||
will be useful, but without any warranty; without even the implied
|
||||
warranty of merchantability or fitness for a particular purpose.
|
||||
|
||||
=cut
|
||||
1252
database/perl/vendor/lib/HTML/Parser.pm
vendored
Normal file
1252
database/perl/vendor/lib/HTML/Parser.pm
vendored
Normal file
File diff suppressed because it is too large
Load Diff
210
database/perl/vendor/lib/HTML/PullParser.pm
vendored
Normal file
210
database/perl/vendor/lib/HTML/PullParser.pm
vendored
Normal file
@@ -0,0 +1,210 @@
|
||||
package HTML::PullParser;
|
||||
|
||||
use strict;
|
||||
|
||||
require HTML::Parser;
|
||||
our @ISA = qw(HTML::Parser);
|
||||
our $VERSION = '3.75';
|
||||
|
||||
use Carp ();
|
||||
|
||||
sub new
|
||||
{
|
||||
my($class, %cnf) = @_;
|
||||
|
||||
# Construct argspecs for the various events
|
||||
my %argspec;
|
||||
for (qw(start end text declaration comment process default)) {
|
||||
my $tmp = delete $cnf{$_};
|
||||
next unless defined $tmp;
|
||||
$argspec{$_} = $tmp;
|
||||
}
|
||||
Carp::croak("Info not collected for any events")
|
||||
unless %argspec;
|
||||
|
||||
my $file = delete $cnf{file};
|
||||
my $doc = delete $cnf{doc};
|
||||
Carp::croak("Can't parse from both 'doc' and 'file' at the same time")
|
||||
if defined($file) && defined($doc);
|
||||
Carp::croak("No 'doc' or 'file' given to parse from")
|
||||
unless defined($file) || defined($doc);
|
||||
|
||||
# Create object
|
||||
$cnf{api_version} = 3;
|
||||
my $self = $class->SUPER::new(%cnf);
|
||||
|
||||
my $accum = $self->{pullparser_accum} = [];
|
||||
while (my($event, $argspec) = each %argspec) {
|
||||
$self->SUPER::handler($event => $accum, $argspec);
|
||||
}
|
||||
|
||||
if (defined $doc) {
|
||||
$self->{pullparser_str_ref} = ref($doc) ? $doc : \$doc;
|
||||
$self->{pullparser_str_pos} = 0;
|
||||
}
|
||||
else {
|
||||
if (!ref($file) && ref(\$file) ne "GLOB") {
|
||||
require IO::File;
|
||||
$file = IO::File->new($file, "r") || return;
|
||||
}
|
||||
|
||||
$self->{pullparser_file} = $file;
|
||||
}
|
||||
$self;
|
||||
}
|
||||
|
||||
|
||||
sub handler
|
||||
{
|
||||
Carp::croak("Can't set handlers for HTML::PullParser");
|
||||
}
|
||||
|
||||
|
||||
sub get_token
|
||||
{
|
||||
my $self = shift;
|
||||
while (!@{$self->{pullparser_accum}} && !$self->{pullparser_eof}) {
|
||||
if (my $f = $self->{pullparser_file}) {
|
||||
# must try to parse more from the file
|
||||
my $buf;
|
||||
if (read($f, $buf, 512)) {
|
||||
$self->parse($buf);
|
||||
} else {
|
||||
$self->eof;
|
||||
$self->{pullparser_eof}++;
|
||||
delete $self->{pullparser_file};
|
||||
}
|
||||
}
|
||||
elsif (my $sref = $self->{pullparser_str_ref}) {
|
||||
# must try to parse more from the scalar
|
||||
my $pos = $self->{pullparser_str_pos};
|
||||
my $chunk = substr($$sref, $pos, 512);
|
||||
$self->parse($chunk);
|
||||
$pos += length($chunk);
|
||||
if ($pos < length($$sref)) {
|
||||
$self->{pullparser_str_pos} = $pos;
|
||||
}
|
||||
else {
|
||||
$self->eof;
|
||||
$self->{pullparser_eof}++;
|
||||
delete $self->{pullparser_str_ref};
|
||||
delete $self->{pullparser_str_pos};
|
||||
}
|
||||
}
|
||||
else {
|
||||
die;
|
||||
}
|
||||
}
|
||||
shift @{$self->{pullparser_accum}};
|
||||
}
|
||||
|
||||
|
||||
sub unget_token
|
||||
{
|
||||
my $self = shift;
|
||||
unshift @{$self->{pullparser_accum}}, @_;
|
||||
$self;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
|
||||
__END__
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::PullParser - Alternative HTML::Parser interface
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
use HTML::PullParser;
|
||||
|
||||
$p = HTML::PullParser->new(file => "index.html",
|
||||
start => 'event, tagname, @attr',
|
||||
end => 'event, tagname',
|
||||
ignore_elements => [qw(script style)],
|
||||
) || die "Can't open: $!";
|
||||
while (my $token = $p->get_token) {
|
||||
#...do something with $token
|
||||
}
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The HTML::PullParser is an alternative interface to the HTML::Parser class.
|
||||
It basically turns the HTML::Parser inside out. You associate a file
|
||||
(or any IO::Handle object or string) with the parser at construction time and
|
||||
then repeatedly call $parser->get_token to obtain the tags and text
|
||||
found in the parsed document.
|
||||
|
||||
The following methods are provided:
|
||||
|
||||
=over 4
|
||||
|
||||
=item $p = HTML::PullParser->new( file => $file, %options )
|
||||
|
||||
=item $p = HTML::PullParser->new( doc => \$doc, %options )
|
||||
|
||||
A C<HTML::PullParser> can be made to parse from either a file or a
|
||||
literal document based on whether the C<file> or C<doc> option is
|
||||
passed to the parser's constructor.
|
||||
|
||||
The C<file> passed in can either be a file name or a file handle
|
||||
object. If a file name is passed, and it can't be opened for reading,
|
||||
then the constructor will return an undefined value and $! will tell
|
||||
you why it failed. Otherwise the argument is taken to be some object
|
||||
that the C<HTML::PullParser> can read() from when it needs more data.
|
||||
The stream will be read() until EOF, but not closed.
|
||||
|
||||
A C<doc> can be passed plain or as a reference
|
||||
to a scalar. If a reference is passed then the value of this scalar
|
||||
should not be changed before all tokens have been extracted.
|
||||
|
||||
Next the information to be returned for the different token types must
|
||||
be set up. This is done by simply associating an argspec (as defined
|
||||
in L<HTML::Parser>) with the events you have an interest in. For
|
||||
instance, if you want C<start> tokens to be reported as the string
|
||||
C<'S'> followed by the tagname and the attributes you might pass an
|
||||
C<start>-option like this:
|
||||
|
||||
$p = HTML::PullParser->new(
|
||||
doc => $document_to_parse,
|
||||
start => '"S", tagname, @attr',
|
||||
end => '"E", tagname',
|
||||
);
|
||||
|
||||
At last other C<HTML::Parser> options, like C<ignore_tags>, and
|
||||
C<unbroken_text>, can be passed in. Note that you should not use the
|
||||
I<event>_h options to set up parser handlers. That would confuse the
|
||||
inner logic of C<HTML::PullParser>.
|
||||
|
||||
=item $token = $p->get_token
|
||||
|
||||
This method will return the next I<token> found in the HTML document,
|
||||
or C<undef> at the end of the document. The token is returned as an
|
||||
array reference. The content of this array match the argspec set up
|
||||
during C<HTML::PullParser> construction.
|
||||
|
||||
=item $p->unget_token( @tokens )
|
||||
|
||||
If you find out you have read too many tokens you can push them back,
|
||||
so that they are returned again the next time $p->get_token is called.
|
||||
|
||||
=back
|
||||
|
||||
=head1 EXAMPLES
|
||||
|
||||
The 'eg/hform' script shows how we might parse the form section of
|
||||
HTML::Documents using HTML::PullParser.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Parser>, L<HTML::TokeParser>
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1998-2001 Gisle Aas.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
471
database/perl/vendor/lib/HTML/Tagset.pm
vendored
Normal file
471
database/perl/vendor/lib/HTML/Tagset.pm
vendored
Normal file
@@ -0,0 +1,471 @@
|
||||
package HTML::Tagset;
|
||||
|
||||
use strict;
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tagset - data tables useful in parsing HTML
|
||||
|
||||
=head1 VERSION
|
||||
|
||||
Version 3.20
|
||||
|
||||
=cut
|
||||
|
||||
use vars qw( $VERSION );
|
||||
|
||||
$VERSION = '3.20';
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
use HTML::Tagset;
|
||||
# Then use any of the items in the HTML::Tagset package
|
||||
# as need arises
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
This module contains several data tables useful in various kinds of
|
||||
HTML parsing operations.
|
||||
|
||||
Note that all tag names used are lowercase.
|
||||
|
||||
In the following documentation, a "hashset" is a hash being used as a
|
||||
set -- the hash conveys that its keys are there, and the actual values
|
||||
associated with the keys are not significant. (But what values are
|
||||
there, are always true.)
|
||||
|
||||
=cut
|
||||
|
||||
use vars qw(
|
||||
$VERSION
|
||||
%emptyElement %optionalEndTag %linkElements %boolean_attr
|
||||
%isHeadElement %isBodyElement %isPhraseMarkup
|
||||
%is_Possible_Strict_P_Content
|
||||
%isHeadOrBodyElement
|
||||
%isList %isTableElement %isFormElement
|
||||
%isKnown %canTighten
|
||||
@p_closure_barriers
|
||||
%isCDATA_Parent
|
||||
);
|
||||
|
||||
=head1 VARIABLES
|
||||
|
||||
Note that none of these variables are exported.
|
||||
|
||||
=head2 hashset %HTML::Tagset::emptyElement
|
||||
|
||||
This hashset has as values the tag-names (GIs) of elements that cannot
|
||||
have content. (For example, "base", "br", "hr".) So
|
||||
C<$HTML::Tagset::emptyElement{'hr'}> exists and is true.
|
||||
C<$HTML::Tagset::emptyElement{'dl'}> does not exist, and so is not true.
|
||||
|
||||
=cut
|
||||
|
||||
%emptyElement = map {; $_ => 1 } qw(base link meta isindex
|
||||
img br hr wbr
|
||||
input area param
|
||||
embed bgsound spacer
|
||||
basefont col frame
|
||||
~comment ~literal
|
||||
~declaration ~pi
|
||||
);
|
||||
# The "~"-initial names are for pseudo-elements used by HTML::Entities
|
||||
# and TreeBuilder
|
||||
|
||||
=head2 hashset %HTML::Tagset::optionalEndTag
|
||||
|
||||
This hashset lists tag-names for elements that can have content, but whose
|
||||
end-tags are generally, "safely", omissible. Example:
|
||||
C<$HTML::Tagset::emptyElement{'li'}> exists and is true.
|
||||
|
||||
=cut
|
||||
|
||||
%optionalEndTag = map {; $_ => 1 } qw(p li dt dd); # option th tr td);
|
||||
|
||||
=head2 hash %HTML::Tagset::linkElements
|
||||
|
||||
Values in this hash are tagnames for elements that might contain
|
||||
links, and the value for each is a reference to an array of the names
|
||||
of attributes whose values can be links.
|
||||
|
||||
=cut
|
||||
|
||||
%linkElements =
|
||||
(
|
||||
'a' => ['href'],
|
||||
'applet' => ['archive', 'codebase', 'code'],
|
||||
'area' => ['href'],
|
||||
'base' => ['href'],
|
||||
'bgsound' => ['src'],
|
||||
'blockquote' => ['cite'],
|
||||
'body' => ['background'],
|
||||
'del' => ['cite'],
|
||||
'embed' => ['pluginspage', 'src'],
|
||||
'form' => ['action'],
|
||||
'frame' => ['src', 'longdesc'],
|
||||
'iframe' => ['src', 'longdesc'],
|
||||
'ilayer' => ['background'],
|
||||
'img' => ['src', 'lowsrc', 'longdesc', 'usemap'],
|
||||
'input' => ['src', 'usemap'],
|
||||
'ins' => ['cite'],
|
||||
'isindex' => ['action'],
|
||||
'head' => ['profile'],
|
||||
'layer' => ['background', 'src'],
|
||||
'link' => ['href'],
|
||||
'object' => ['classid', 'codebase', 'data', 'archive', 'usemap'],
|
||||
'q' => ['cite'],
|
||||
'script' => ['src', 'for'],
|
||||
'table' => ['background'],
|
||||
'td' => ['background'],
|
||||
'th' => ['background'],
|
||||
'tr' => ['background'],
|
||||
'xmp' => ['href'],
|
||||
);
|
||||
|
||||
=head2 hash %HTML::Tagset::boolean_attr
|
||||
|
||||
This hash (not hashset) lists what attributes of what elements can be
|
||||
printed without showing the value (for example, the "noshade" attribute
|
||||
of "hr" elements). For elements with only one such attribute, its value
|
||||
is simply that attribute name. For elements with many such attributes,
|
||||
the value is a reference to a hashset containing all such attributes.
|
||||
|
||||
=cut
|
||||
|
||||
%boolean_attr = (
|
||||
# TODO: make these all hashes
|
||||
'area' => 'nohref',
|
||||
'dir' => 'compact',
|
||||
'dl' => 'compact',
|
||||
'hr' => 'noshade',
|
||||
'img' => 'ismap',
|
||||
'input' => { 'checked' => 1, 'readonly' => 1, 'disabled' => 1 },
|
||||
'menu' => 'compact',
|
||||
'ol' => 'compact',
|
||||
'option' => 'selected',
|
||||
'select' => 'multiple',
|
||||
'td' => 'nowrap',
|
||||
'th' => 'nowrap',
|
||||
'ul' => 'compact',
|
||||
);
|
||||
|
||||
#==========================================================================
|
||||
# List of all elements from Extensible HTML version 1.0 Transitional DTD:
|
||||
#
|
||||
# a abbr acronym address applet area b base basefont bdo big
|
||||
# blockquote body br button caption center cite code col colgroup
|
||||
# dd del dfn dir div dl dt em fieldset font form h1 h2 h3 h4 h5 h6
|
||||
# head hr html i iframe img input ins isindex kbd label legend li
|
||||
# link map menu meta noframes noscript object ol optgroup option p
|
||||
# param pre q s samp script select small span strike strong style
|
||||
# sub sup table tbody td textarea tfoot th thead title tr tt u ul
|
||||
# var
|
||||
#
|
||||
# Varia from Mozilla source internal table of tags:
|
||||
# Implemented:
|
||||
# xmp listing wbr nobr frame frameset noframes ilayer
|
||||
# layer nolayer spacer embed multicol
|
||||
# But these are unimplemented:
|
||||
# sound?? keygen?? server??
|
||||
# Also seen here and there:
|
||||
# marquee?? app?? (both unimplemented)
|
||||
#==========================================================================
|
||||
|
||||
=head2 hashset %HTML::Tagset::isPhraseMarkup
|
||||
|
||||
This hashset contains all phrasal-level elements.
|
||||
|
||||
=cut
|
||||
|
||||
%isPhraseMarkup = map {; $_ => 1 } qw(
|
||||
span abbr acronym q sub sup
|
||||
cite code em kbd samp strong var dfn strike
|
||||
b i u s tt small big
|
||||
a img br
|
||||
wbr nobr blink
|
||||
font basefont bdo
|
||||
spacer embed noembed
|
||||
); # had: center, hr, table
|
||||
|
||||
|
||||
=head2 hashset %HTML::Tagset::is_Possible_Strict_P_Content
|
||||
|
||||
This hashset contains all phrasal-level elements that be content of a
|
||||
P element, for a strict model of HTML.
|
||||
|
||||
=cut
|
||||
|
||||
%is_Possible_Strict_P_Content = (
|
||||
%isPhraseMarkup,
|
||||
%isFormElement,
|
||||
map {; $_ => 1} qw( object script map )
|
||||
# I've no idea why there's these latter exceptions.
|
||||
# I'm just following the HTML4.01 DTD.
|
||||
);
|
||||
|
||||
#from html4 strict:
|
||||
#<!ENTITY % fontstyle "TT | I | B | BIG | SMALL">
|
||||
#
|
||||
#<!ENTITY % phrase "EM | STRONG | DFN | CODE |
|
||||
# SAMP | KBD | VAR | CITE | ABBR | ACRONYM" >
|
||||
#
|
||||
#<!ENTITY % special
|
||||
# "A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO">
|
||||
#
|
||||
#<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
|
||||
#
|
||||
#<!-- %inline; covers inline or "text-level" elements -->
|
||||
#<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">
|
||||
|
||||
=head2 hashset %HTML::Tagset::isHeadElement
|
||||
|
||||
This hashset contains all elements that elements that should be
|
||||
present only in the 'head' element of an HTML document.
|
||||
|
||||
=cut
|
||||
|
||||
%isHeadElement = map {; $_ => 1 }
|
||||
qw(title base link meta isindex script style object bgsound);
|
||||
|
||||
=head2 hashset %HTML::Tagset::isList
|
||||
|
||||
This hashset contains all elements that can contain "li" elements.
|
||||
|
||||
=cut
|
||||
|
||||
%isList = map {; $_ => 1 } qw(ul ol dir menu);
|
||||
|
||||
=head2 hashset %HTML::Tagset::isTableElement
|
||||
|
||||
This hashset contains all elements that are to be found only in/under
|
||||
a "table" element.
|
||||
|
||||
=cut
|
||||
|
||||
%isTableElement = map {; $_ => 1 }
|
||||
qw(tr td th thead tbody tfoot caption col colgroup);
|
||||
|
||||
=head2 hashset %HTML::Tagset::isFormElement
|
||||
|
||||
This hashset contains all elements that are to be found only in/under
|
||||
a "form" element.
|
||||
|
||||
=cut
|
||||
|
||||
%isFormElement = map {; $_ => 1 }
|
||||
qw(input select option optgroup textarea button label);
|
||||
|
||||
=head2 hashset %HTML::Tagset::isBodyMarkup
|
||||
|
||||
This hashset contains all elements that are to be found only in/under
|
||||
the "body" element of an HTML document.
|
||||
|
||||
=cut
|
||||
|
||||
%isBodyElement = map {; $_ => 1 } qw(
|
||||
h1 h2 h3 h4 h5 h6
|
||||
p div pre plaintext address blockquote
|
||||
xmp listing
|
||||
center
|
||||
|
||||
multicol
|
||||
iframe ilayer nolayer
|
||||
bgsound
|
||||
|
||||
hr
|
||||
ol ul dir menu li
|
||||
dl dt dd
|
||||
ins del
|
||||
|
||||
fieldset legend
|
||||
|
||||
map area
|
||||
applet param object
|
||||
isindex script noscript
|
||||
table
|
||||
center
|
||||
form
|
||||
),
|
||||
keys %isFormElement,
|
||||
keys %isPhraseMarkup, # And everything phrasal
|
||||
keys %isTableElement,
|
||||
;
|
||||
|
||||
|
||||
=head2 hashset %HTML::Tagset::isHeadOrBodyElement
|
||||
|
||||
This hashset includes all elements that I notice can fall either in
|
||||
the head or in the body.
|
||||
|
||||
=cut
|
||||
|
||||
%isHeadOrBodyElement = map {; $_ => 1 }
|
||||
qw(script isindex style object map area param noscript bgsound);
|
||||
# i.e., if we find 'script' in the 'body' or the 'head', don't freak out.
|
||||
|
||||
|
||||
=head2 hashset %HTML::Tagset::isKnown
|
||||
|
||||
This hashset lists all known HTML elements.
|
||||
|
||||
=cut
|
||||
|
||||
%isKnown = (%isHeadElement, %isBodyElement,
|
||||
map{; $_=>1 }
|
||||
qw( head body html
|
||||
frame frameset noframes
|
||||
~comment ~pi ~directive ~literal
|
||||
));
|
||||
# that should be all known tags ever ever
|
||||
|
||||
|
||||
=head2 hashset %HTML::Tagset::canTighten
|
||||
|
||||
This hashset lists elements that might have ignorable whitespace as
|
||||
children or siblings.
|
||||
|
||||
=cut
|
||||
|
||||
%canTighten = %isKnown;
|
||||
delete @canTighten{
|
||||
keys(%isPhraseMarkup), 'input', 'select',
|
||||
'xmp', 'listing', 'plaintext', 'pre',
|
||||
};
|
||||
# xmp, listing, plaintext, and pre are untightenable, and
|
||||
# in a really special way.
|
||||
@canTighten{'hr','br'} = (1,1);
|
||||
# exceptional 'phrasal' things that ARE subject to tightening.
|
||||
|
||||
# The one case where I can think of my tightening rules failing is:
|
||||
# <p>foo bar<center> <em>baz quux</em> ...
|
||||
# ^-- that would get deleted.
|
||||
# But that's pretty gruesome code anyhow. You gets what you pays for.
|
||||
|
||||
#==========================================================================
|
||||
|
||||
=head2 array @HTML::Tagset::p_closure_barriers
|
||||
|
||||
This array has a meaning that I have only seen a need for in
|
||||
C<HTML::TreeBuilder>, but I include it here on the off chance that someone
|
||||
might find it of use:
|
||||
|
||||
When we see a "E<lt>pE<gt>" token, we go lookup up the lineage for a p
|
||||
element we might have to minimize. At first sight, we might say that
|
||||
if there's a p anywhere in the lineage of this new p, it should be
|
||||
closed. But that's wrong. Consider this document:
|
||||
|
||||
<html>
|
||||
<head>
|
||||
<title>foo</title>
|
||||
</head>
|
||||
<body>
|
||||
<p>foo
|
||||
<table>
|
||||
<tr>
|
||||
<td>
|
||||
foo
|
||||
<p>bar
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
The second p is quite legally inside a much higher p.
|
||||
|
||||
My formalization of the reason why this is legal, but this:
|
||||
|
||||
<p>foo<p>bar</p></p>
|
||||
|
||||
isn't, is that something about the table constitutes a "barrier" to
|
||||
the application of the rule about what p must minimize.
|
||||
|
||||
So C<@HTML::Tagset::p_closure_barriers> is the list of all such
|
||||
barrier-tags.
|
||||
|
||||
=cut
|
||||
|
||||
@p_closure_barriers = qw(
|
||||
li blockquote
|
||||
ul ol menu dir
|
||||
dl dt dd
|
||||
td th tr table caption
|
||||
div
|
||||
);
|
||||
|
||||
# In an ideal world (i.e., XHTML) we wouldn't have to bother with any of this
|
||||
# monkey business of barriers to minimization!
|
||||
|
||||
=head2 hashset %isCDATA_Parent
|
||||
|
||||
This hashset includes all elements whose content is CDATA.
|
||||
|
||||
=cut
|
||||
|
||||
%isCDATA_Parent = map {; $_ => 1 }
|
||||
qw(script style xmp listing plaintext);
|
||||
|
||||
# TODO: there's nothing else that takes CDATA children, right?
|
||||
|
||||
# As the HTML3 DTD (Raggett 1995-04-24) noted:
|
||||
# The XMP, LISTING and PLAINTEXT tags are incompatible with SGML
|
||||
# and derive from very early versions of HTML. They require non-
|
||||
# standard parsers and will cause problems for processing
|
||||
# documents with standard SGML tools.
|
||||
|
||||
|
||||
=head1 CAVEATS
|
||||
|
||||
You may find it useful to alter the behavior of modules (like
|
||||
C<HTML::Element> or C<HTML::TreeBuilder>) that use C<HTML::Tagset>'s
|
||||
data tables by altering the data tables themselves. You are welcome
|
||||
to try, but be careful; and be aware that different modules may or may
|
||||
react differently to the data tables being changed.
|
||||
|
||||
Note that it may be inappropriate to use these tables for I<producing>
|
||||
HTML -- for example, C<%isHeadOrBodyElement> lists the tagnames
|
||||
for all elements that can appear either in the head or in the body,
|
||||
such as "script". That doesn't mean that I am saying your code that
|
||||
produces HTML should feel free to put script elements in either place!
|
||||
If you are producing programs that spit out HTML, you should be
|
||||
I<intimately> familiar with the DTDs for HTML or XHTML (available at
|
||||
C<http://www.w3.org/>), and you should slavishly obey them, not
|
||||
the data tables in this document.
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::Element>, L<HTML::TreeBuilder>, L<HTML::LinkExtor>
|
||||
|
||||
=head1 COPYRIGHT & LICENSE
|
||||
|
||||
Copyright 1995-2000 Gisle Aas.
|
||||
|
||||
Copyright 2000-2005 Sean M. Burke.
|
||||
|
||||
Copyright 2005-2008 Andy Lester.
|
||||
|
||||
This program is free software; you can redistribute it and/or modify it
|
||||
under the same terms as Perl itself.
|
||||
|
||||
=head1 ACKNOWLEDGEMENTS
|
||||
|
||||
Most of the code/data in this module was adapted from code written
|
||||
by Gisle Aas for C<HTML::Element>, C<HTML::TreeBuilder>, and
|
||||
C<HTML::LinkExtor>. Then it was maintained by Sean M. Burke.
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Current maintainer: Andy Lester, C<< <andy at petdance.com> >>
|
||||
|
||||
=head1 BUGS
|
||||
|
||||
Please report any bugs or feature requests to
|
||||
C<bug-html-tagset at rt.cpan.org>, or through the web interface at
|
||||
L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Tagset>. I will
|
||||
be notified, and then you'll automatically be notified of progress on
|
||||
your bug as I make changes.
|
||||
|
||||
=cut
|
||||
|
||||
1;
|
||||
372
database/perl/vendor/lib/HTML/TokeParser.pm
vendored
Normal file
372
database/perl/vendor/lib/HTML/TokeParser.pm
vendored
Normal file
@@ -0,0 +1,372 @@
|
||||
package HTML::TokeParser;
|
||||
|
||||
use strict;
|
||||
|
||||
require HTML::PullParser;
|
||||
our @ISA = qw(HTML::PullParser);
|
||||
our $VERSION = '3.75';
|
||||
|
||||
use Carp ();
|
||||
use HTML::Entities qw(decode_entities);
|
||||
use HTML::Tagset ();
|
||||
|
||||
my %ARGS =
|
||||
(
|
||||
start => "'S',tagname,attr,attrseq,text",
|
||||
end => "'E',tagname,text",
|
||||
text => "'T',text,is_cdata",
|
||||
process => "'PI',token0,text",
|
||||
comment => "'C',text",
|
||||
declaration => "'D',text",
|
||||
|
||||
# options that default on
|
||||
unbroken_text => 1,
|
||||
);
|
||||
|
||||
|
||||
sub new
|
||||
{
|
||||
my $class = shift;
|
||||
my %cnf;
|
||||
|
||||
if (@_ == 1) {
|
||||
my $type = (ref($_[0]) eq "SCALAR") ? "doc" : "file";
|
||||
%cnf = ($type => $_[0]);
|
||||
}
|
||||
else {
|
||||
unshift @_, (ref($_[0]) eq "SCALAR") ? "doc" : "file" if(scalar(@_) % 2 == 1);
|
||||
%cnf = @_;
|
||||
}
|
||||
|
||||
my $textify = delete $cnf{textify} || {img => "alt", applet => "alt"};
|
||||
|
||||
my $self = $class->SUPER::new(%ARGS, %cnf) || return undef;
|
||||
|
||||
$self->{textify} = $textify;
|
||||
$self;
|
||||
}
|
||||
|
||||
|
||||
sub get_tag
|
||||
{
|
||||
my $self = shift;
|
||||
my $token;
|
||||
while (1) {
|
||||
$token = $self->get_token || return undef;
|
||||
my $type = shift @$token;
|
||||
next unless $type eq "S" || $type eq "E";
|
||||
substr($token->[0], 0, 0) = "/" if $type eq "E";
|
||||
return $token unless @_;
|
||||
for (@_) {
|
||||
return $token if $token->[0] eq $_;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
sub _textify {
|
||||
my($self, $token) = @_;
|
||||
my $tag = $token->[1];
|
||||
return undef unless exists $self->{textify}{$tag};
|
||||
|
||||
my $alt = $self->{textify}{$tag};
|
||||
my $text;
|
||||
if (ref($alt)) {
|
||||
$text = &$alt(@$token);
|
||||
} else {
|
||||
$text = $token->[2]{$alt || "alt"};
|
||||
$text = "[\U$tag]" unless defined $text;
|
||||
}
|
||||
return $text;
|
||||
}
|
||||
|
||||
|
||||
sub get_text
|
||||
{
|
||||
my $self = shift;
|
||||
my @text;
|
||||
while (my $token = $self->get_token) {
|
||||
my $type = $token->[0];
|
||||
if ($type eq "T") {
|
||||
my $text = $token->[1];
|
||||
decode_entities($text) unless $token->[2];
|
||||
push(@text, $text);
|
||||
} elsif ($type =~ /^[SE]$/) {
|
||||
my $tag = $token->[1];
|
||||
if ($type eq "S") {
|
||||
if (defined(my $text = _textify($self, $token))) {
|
||||
push(@text, $text);
|
||||
next;
|
||||
}
|
||||
} else {
|
||||
$tag = "/$tag";
|
||||
}
|
||||
if (!@_ || grep $_ eq $tag, @_) {
|
||||
$self->unget_token($token);
|
||||
last;
|
||||
}
|
||||
push(@text, " ")
|
||||
if $tag eq "br" || !$HTML::Tagset::isPhraseMarkup{$token->[1]};
|
||||
}
|
||||
}
|
||||
join("", @text);
|
||||
}
|
||||
|
||||
|
||||
sub get_trimmed_text
|
||||
{
|
||||
my $self = shift;
|
||||
my $text = $self->get_text(@_);
|
||||
$text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
|
||||
$text;
|
||||
}
|
||||
|
||||
sub get_phrase {
|
||||
my $self = shift;
|
||||
my @text;
|
||||
while (my $token = $self->get_token) {
|
||||
my $type = $token->[0];
|
||||
if ($type eq "T") {
|
||||
my $text = $token->[1];
|
||||
decode_entities($text) unless $token->[2];
|
||||
push(@text, $text);
|
||||
} elsif ($type =~ /^[SE]$/) {
|
||||
my $tag = $token->[1];
|
||||
if ($type eq "S") {
|
||||
if (defined(my $text = _textify($self, $token))) {
|
||||
push(@text, $text);
|
||||
next;
|
||||
}
|
||||
}
|
||||
if (!$HTML::Tagset::isPhraseMarkup{$tag}) {
|
||||
$self->unget_token($token);
|
||||
last;
|
||||
}
|
||||
push(@text, " ") if $tag eq "br";
|
||||
}
|
||||
}
|
||||
my $text = join("", @text);
|
||||
$text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
|
||||
$text;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
|
||||
__END__
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::TokeParser - Alternative HTML::Parser interface
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
require HTML::TokeParser;
|
||||
$p = HTML::TokeParser->new("index.html") ||
|
||||
die "Can't open: $!";
|
||||
$p->empty_element_tags(1); # configure its behaviour
|
||||
|
||||
while (my $token = $p->get_token) {
|
||||
#...
|
||||
}
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The C<HTML::TokeParser> is an alternative interface to the
|
||||
C<HTML::Parser> class. It is an C<HTML::PullParser> subclass with a
|
||||
predeclared set of token types. If you wish the tokens to be reported
|
||||
differently you probably want to use the C<HTML::PullParser> directly.
|
||||
|
||||
The following methods are available:
|
||||
|
||||
=over 4
|
||||
|
||||
=item $p = HTML::TokeParser->new( $filename, %opt );
|
||||
|
||||
=item $p = HTML::TokeParser->new( $filehandle, %opt );
|
||||
|
||||
=item $p = HTML::TokeParser->new( \$document, %opt );
|
||||
|
||||
The object constructor argument is either a file name, a file handle
|
||||
object, or the complete document to be parsed. Extra options can be
|
||||
provided as key/value pairs and are processed as documented by the base
|
||||
classes.
|
||||
|
||||
If the argument is a plain scalar, then it is taken as the name of a
|
||||
file to be opened and parsed. If the file can't be opened for
|
||||
reading, then the constructor will return C<undef> and $! will tell
|
||||
you why it failed.
|
||||
|
||||
If the argument is a reference to a plain scalar, then this scalar is
|
||||
taken to be the literal document to parse. The value of this
|
||||
scalar should not be changed before all tokens have been extracted.
|
||||
|
||||
Otherwise the argument is taken to be some object that the
|
||||
C<HTML::TokeParser> can read() from when it needs more data. Typically
|
||||
it will be a filehandle of some kind. The stream will be read() until
|
||||
EOF, but not closed.
|
||||
|
||||
A newly constructed C<HTML::TokeParser> differ from its base classes
|
||||
by having the C<unbroken_text> attribute enabled by default. See
|
||||
L<HTML::Parser> for a description of this and other attributes that
|
||||
influence how the document is parsed. It is often a good idea to enable
|
||||
C<empty_element_tags> behaviour.
|
||||
|
||||
Note that the parsing result will likely not be valid if raw undecoded
|
||||
UTF-8 is used as a source. When parsing UTF-8 encoded files turn
|
||||
on UTF-8 decoding:
|
||||
|
||||
open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html': $!";
|
||||
my $p = HTML::TokeParser->new( $fh );
|
||||
# ...
|
||||
|
||||
If a $filename is passed to the constructor the file will be opened in
|
||||
raw mode and the parsing result will only be valid if its content is
|
||||
Latin-1 or pure ASCII.
|
||||
|
||||
If parsing from an UTF-8 encoded string buffer decode it first:
|
||||
|
||||
utf8::decode($document);
|
||||
my $p = HTML::TokeParser->new( \$document );
|
||||
# ...
|
||||
|
||||
=item $p->get_token
|
||||
|
||||
This method will return the next I<token> found in the HTML document,
|
||||
or C<undef> at the end of the document. The token is returned as an
|
||||
array reference. The first element of the array will be a string
|
||||
denoting the type of this token: "S" for start tag, "E" for end tag,
|
||||
"T" for text, "C" for comment, "D" for declaration, and "PI" for
|
||||
process instructions. The rest of the token array depend on the type
|
||||
like this:
|
||||
|
||||
["S", $tag, $attr, $attrseq, $text]
|
||||
["E", $tag, $text]
|
||||
["T", $text, $is_data]
|
||||
["C", $text]
|
||||
["D", $text]
|
||||
["PI", $token0, $text]
|
||||
|
||||
where $attr is a hash reference, $attrseq is an array reference and
|
||||
the rest are plain scalars. The L<HTML::Parser/Argspec> explains the
|
||||
details.
|
||||
|
||||
=item $p->unget_token( @tokens )
|
||||
|
||||
If you find you have read too many tokens you can push them back,
|
||||
so that they are returned the next time $p->get_token is called.
|
||||
|
||||
=item $p->get_tag
|
||||
|
||||
=item $p->get_tag( @tags )
|
||||
|
||||
This method returns the next start or end tag (skipping any other
|
||||
tokens), or C<undef> if there are no more tags in the document. If
|
||||
one or more arguments are given, then we skip tokens until one of the
|
||||
specified tag types is found. For example:
|
||||
|
||||
$p->get_tag("font", "/font");
|
||||
|
||||
will find the next start or end tag for a font-element.
|
||||
|
||||
The tag information is returned as an array reference in the same form
|
||||
as for $p->get_token above, but the type code (first element) is
|
||||
missing. A start tag will be returned like this:
|
||||
|
||||
[$tag, $attr, $attrseq, $text]
|
||||
|
||||
The tagname of end tags are prefixed with "/", i.e. end tag is
|
||||
returned like this:
|
||||
|
||||
["/$tag", $text]
|
||||
|
||||
=item $p->get_text
|
||||
|
||||
=item $p->get_text( @endtags )
|
||||
|
||||
This method returns all text found at the current position. It will
|
||||
return a zero length string if the next token is not text. Any
|
||||
entities will be converted to their corresponding character.
|
||||
|
||||
If one or more arguments are given, then we return all text occurring
|
||||
before the first of the specified tags found. For example:
|
||||
|
||||
$p->get_text("p", "br");
|
||||
|
||||
will return the text up to either a paragraph of line break element.
|
||||
|
||||
The text might span tags that should be I<textified>. This is
|
||||
controlled by the $p->{textify} attribute, which is a hash that
|
||||
defines how certain tags can be treated as text. If the name of a
|
||||
start tag matches a key in this hash then this tag is converted to
|
||||
text. The hash value is used to specify which tag attribute to obtain
|
||||
the text from. If this tag attribute is missing, then the upper case
|
||||
name of the tag enclosed in brackets is returned, e.g. "[IMG]". The
|
||||
hash value can also be a subroutine reference. In this case the
|
||||
routine is called with the start tag token content as its argument and
|
||||
the return value is treated as the text.
|
||||
|
||||
The default $p->{textify} value is:
|
||||
|
||||
{img => "alt", applet => "alt"}
|
||||
|
||||
This means that <IMG> and <APPLET> tags are treated as text, and that
|
||||
the text to substitute can be found in the ALT attribute.
|
||||
|
||||
=item $p->get_trimmed_text
|
||||
|
||||
=item $p->get_trimmed_text( @endtags )
|
||||
|
||||
Same as $p->get_text above, but will collapse any sequences of white
|
||||
space to a single space character. Leading and trailing white space is
|
||||
removed.
|
||||
|
||||
=item $p->get_phrase
|
||||
|
||||
This will return all text found at the current position ignoring any
|
||||
phrasal-level tags. Text is extracted until the first non
|
||||
phrasal-level tag. Textification of tags is the same as for
|
||||
get_text(). This method will collapse white space in the same way as
|
||||
get_trimmed_text() does.
|
||||
|
||||
The definition of <i>phrasal-level tags</i> is obtained from the
|
||||
HTML::Tagset module.
|
||||
|
||||
=back
|
||||
|
||||
=head1 EXAMPLES
|
||||
|
||||
This example extracts all links from a document. It will print one
|
||||
line for each link, containing the URL and the textual description
|
||||
between the <A>...</A> tags:
|
||||
|
||||
use HTML::TokeParser;
|
||||
$p = HTML::TokeParser->new(shift||"index.html");
|
||||
|
||||
while (my $token = $p->get_tag("a")) {
|
||||
my $url = $token->[1]{href} || "-";
|
||||
my $text = $p->get_trimmed_text("/a");
|
||||
print "$url\t$text\n";
|
||||
}
|
||||
|
||||
This example extract the <TITLE> from the document:
|
||||
|
||||
use HTML::TokeParser;
|
||||
$p = HTML::TokeParser->new(shift||"index.html");
|
||||
if ($p->get_tag("title")) {
|
||||
my $title = $p->get_trimmed_text;
|
||||
print "Title: $title\n";
|
||||
}
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::PullParser>, L<HTML::Parser>
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1998-2005 Gisle Aas.
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
=cut
|
||||
256
database/perl/vendor/lib/HTML/Tree.pm
vendored
Normal file
256
database/perl/vendor/lib/HTML/Tree.pm
vendored
Normal file
@@ -0,0 +1,256 @@
|
||||
package HTML::Tree;
|
||||
|
||||
# ABSTRACT: build and scan parse-trees of HTML
|
||||
|
||||
# HTML::Tree is basically just a happy alias to HTML::TreeBuilder.
|
||||
|
||||
use warnings;
|
||||
use strict;
|
||||
|
||||
our $VERSION = '5.07'; # VERSION from OurPkgVersion
|
||||
|
||||
use HTML::TreeBuilder ();
|
||||
|
||||
|
||||
sub new {
|
||||
shift;
|
||||
unshift @_, 'HTML::TreeBuilder';
|
||||
goto &HTML::TreeBuilder::new;
|
||||
}
|
||||
|
||||
sub new_from_file {
|
||||
shift;
|
||||
unshift @_, 'HTML::TreeBuilder';
|
||||
goto &HTML::TreeBuilder::new_from_file;
|
||||
}
|
||||
|
||||
sub new_from_content {
|
||||
shift;
|
||||
unshift @_, 'HTML::TreeBuilder';
|
||||
goto &HTML::TreeBuilder::new_from_content;
|
||||
}
|
||||
|
||||
sub new_from_url {
|
||||
shift;
|
||||
unshift @_, 'HTML::TreeBuilder';
|
||||
goto &HTML::TreeBuilder::new_from_url;
|
||||
}
|
||||
|
||||
1;
|
||||
|
||||
__END__
|
||||
|
||||
=pod
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tree - build and scan parse-trees of HTML
|
||||
|
||||
=head1 VERSION
|
||||
|
||||
This document describes version 5.07 of
|
||||
HTML::Tree, released August 31, 2017
|
||||
as part of L<HTML-Tree|HTML::Tree>.
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
use HTML::TreeBuilder;
|
||||
my $tree = HTML::TreeBuilder->new();
|
||||
$tree->parse_file($filename);
|
||||
|
||||
# Then do something with the tree, using HTML::Element
|
||||
# methods -- for example:
|
||||
|
||||
$tree->dump
|
||||
|
||||
# Finally:
|
||||
|
||||
$tree->delete;
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
HTML-Tree is a suite of Perl modules for making parse trees out of
|
||||
HTML source. It consists of mainly two modules, whose documentation
|
||||
you should refer to: L<HTML::TreeBuilder|HTML::TreeBuilder>
|
||||
and L<HTML::Element|HTML::Element>.
|
||||
|
||||
HTML::TreeBuilder is the module that builds the parse trees. (It uses
|
||||
HTML::Parser to do the work of breaking the HTML up into tokens.)
|
||||
|
||||
The tree that TreeBuilder builds for you is made up of objects of the
|
||||
class HTML::Element.
|
||||
|
||||
If you find that you do not properly understand the documentation
|
||||
for HTML::TreeBuilder and HTML::Element, it may be because you are
|
||||
unfamiliar with tree-shaped data structures, or with object-oriented
|
||||
modules in general. Sean Burke has written some articles for
|
||||
I<The Perl Journal> (C<www.tpj.com>) that seek to provide that background.
|
||||
The full text of those articles is contained in this distribution, as:
|
||||
|
||||
=over 4
|
||||
|
||||
=item L<HTML::Tree::AboutObjects|HTML::Tree::AboutObjects>
|
||||
|
||||
"User's View of Object-Oriented Modules" from TPJ17.
|
||||
|
||||
=item L<HTML::Tree::AboutTrees|HTML::Tree::AboutTrees>
|
||||
|
||||
"Trees" from TPJ18
|
||||
|
||||
=item L<HTML::Tree::Scanning|HTML::Tree::Scanning>
|
||||
|
||||
"Scanning HTML" from TPJ19
|
||||
|
||||
=back
|
||||
|
||||
Readers already familiar with object-oriented modules and tree-shaped
|
||||
data structures should read just the last article. Readers without
|
||||
that background should read the first, then the second, and then the
|
||||
third.
|
||||
|
||||
=head1 METHODS
|
||||
|
||||
All these methods simply redirect to the corresponding method in
|
||||
HTML::TreeBuilder. It's more efficient to use HTML::TreeBuilder
|
||||
directly, and skip loading HTML::Tree at all.
|
||||
|
||||
=head2 new
|
||||
|
||||
Redirects to L<HTML::TreeBuilder/new>.
|
||||
|
||||
=head2 new_from_file
|
||||
|
||||
Redirects to L<HTML::TreeBuilder/new_from_file>.
|
||||
|
||||
=head2 new_from_content
|
||||
|
||||
Redirects to L<HTML::TreeBuilder/new_from_content>.
|
||||
|
||||
=head2 new_from_url
|
||||
|
||||
Redirects to L<HTML::TreeBuilder/new_from_url>.
|
||||
|
||||
=head1 SUPPORT
|
||||
|
||||
You can find documentation for this module with the perldoc command.
|
||||
|
||||
perldoc HTML::Tree
|
||||
|
||||
You can also look for information at:
|
||||
|
||||
=over 4
|
||||
|
||||
=item * AnnoCPAN: Annotated CPAN documentation
|
||||
|
||||
L<http://annocpan.org/dist/HTML-Tree>
|
||||
|
||||
=item * CPAN Ratings
|
||||
|
||||
L<http://cpanratings.perl.org/d/HTML-Tree>
|
||||
|
||||
=item * RT: CPAN's request tracker
|
||||
|
||||
L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Tree>
|
||||
|
||||
=item * Search CPAN
|
||||
|
||||
L<http://search.cpan.org/dist/HTML-Tree>
|
||||
|
||||
=item * Stack Overflow
|
||||
|
||||
L<http://stackoverflow.com/questions/tagged/html-tree>
|
||||
|
||||
If you have a question about how to use HTML-Tree, Stack Overflow is
|
||||
the place to ask it. Make sure you tag it both C<perl> and C<html-tree>.
|
||||
|
||||
=back
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
L<HTML::TreeBuilder>, L<HTML::Element>, L<HTML::Tagset>,
|
||||
L<HTML::Parser>, L<HTML::DOMbo>
|
||||
|
||||
The book I<Perl & LWP> by Sean M. Burke published by
|
||||
O'Reilly and Associates, 2002. ISBN: 0-596-00178-9
|
||||
|
||||
It has several chapters to do with HTML processing in general,
|
||||
and HTML-Tree specifically. There's more info at:
|
||||
|
||||
http://www.oreilly.com/catalog/perllwp/
|
||||
|
||||
http://www.amazon.com/exec/obidos/ASIN/0596001789
|
||||
|
||||
=head1 SOURCE REPOSITORY
|
||||
|
||||
HTML-Tree is now maintained using Git. The main public repository is
|
||||
L<< https://github.com/kentfredric/HTML-Tree >>.
|
||||
|
||||
The best way to send a patch is to make a pull request there.
|
||||
|
||||
=head1 ACKNOWLEDGEMENTS
|
||||
|
||||
Thanks to Gisle Aas, Sean Burke and Andy Lester for their original work.
|
||||
|
||||
Thanks to Chicago Perl Mongers (http://chicago.pm.org) for their
|
||||
patches submitted to HTML::Tree as part of the Phalanx project
|
||||
(http://qa.perl.org/phalanx).
|
||||
|
||||
Thanks to the following people for additional patches and documentation:
|
||||
Terrence Brannon, Gordon Lack, Chris Madsen and Ricardo Signes.
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Current maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
|
||||
|
||||
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
Original HTML-Tree author:
|
||||
|
||||
=over
|
||||
|
||||
=item * Gisle Aas
|
||||
|
||||
=back
|
||||
|
||||
Former maintainers:
|
||||
|
||||
=over
|
||||
|
||||
=item * Sean M. Burke
|
||||
|
||||
=item * Andy Lester
|
||||
|
||||
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
|
||||
|
||||
=back
|
||||
|
||||
You can follow or contribute to HTML-Tree's development at
|
||||
L<< https://github.com/kentfredric/HTML-Tree >>.
|
||||
|
||||
=head1 COPYRIGHT AND LICENSE
|
||||
|
||||
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
|
||||
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
|
||||
2012 Christopher J. Madsen.
|
||||
(Except the articles contained in HTML::Tree::AboutObjects,
|
||||
HTML::Tree::AboutTrees, and HTML::Tree::Scanning, which are all
|
||||
copyright 2000 The Perl Journal.)
|
||||
|
||||
Except for those three TPJ articles, the whole HTML-Tree distribution,
|
||||
of which this file is a part, is free software; you can redistribute
|
||||
it and/or modify it under the same terms as Perl itself.
|
||||
|
||||
Those three TPJ articles may be distributed under the same terms as
|
||||
Perl itself.
|
||||
|
||||
The programs in this library are distributed in the hope that they
|
||||
will be useful, but without any warranty; without even the implied
|
||||
warranty of merchantability or fitness for a particular purpose.
|
||||
|
||||
=cut
|
||||
686
database/perl/vendor/lib/HTML/Tree/AboutObjects.pod
vendored
Normal file
686
database/perl/vendor/lib/HTML/Tree/AboutObjects.pod
vendored
Normal file
@@ -0,0 +1,686 @@
|
||||
|
||||
#Time-stamp: "2001-02-23 20:07:25 MST" -*-Text-*-
|
||||
# This document contains text in Perl "POD" format.
|
||||
# Use a POD viewer like perldoc or perlman to render it.
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tree::AboutObjects -- article: "User's View of Object-Oriented Modules"
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
# This an article, not a module.
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The following article by Sean M. Burke first appeared in I<The Perl
|
||||
Journal> #17 and is copyright 2000 The Perl Journal. It appears
|
||||
courtesy of Jon Orwant and The Perl Journal. This document may be
|
||||
distributed under the same terms as Perl itself.
|
||||
|
||||
=head1 A User's View of Object-Oriented Modules
|
||||
|
||||
-- Sean M. Burke
|
||||
|
||||
The first time that most Perl programmers run into object-oriented
|
||||
programming when they need to use a module whose interface is
|
||||
object-oriented. This is often a mystifying experience, since talk of
|
||||
"methods" and "constructors" is unintelligible to programmers who
|
||||
thought that functions and variables was all there was to worry about.
|
||||
|
||||
Articles and books that explain object-oriented programming (OOP), do so
|
||||
in terms of how to program that way. That's understandable, and if you
|
||||
learn to write object-oriented code of your own, you'd find it easy to
|
||||
use object-oriented code that others write. But this approach is the
|
||||
I<long> way around for people whose immediate goal is just to use
|
||||
existing object-oriented modules, but who don't yet want to know all the
|
||||
gory details of having to write such modules for themselves.
|
||||
|
||||
This article is for those programmers -- programmers who want to know
|
||||
about objects from the perspective of using object-oriented modules.
|
||||
|
||||
=head2 Modules and Their Functional Interfaces
|
||||
|
||||
Modules are the main way that Perl provides for bundling up code for
|
||||
later use by yourself or others. As I'm sure you can't help noticing
|
||||
from reading
|
||||
I<The Perl Journal>, CPAN (the Comprehensive Perl Archive
|
||||
Network) is the repository for modules (or groups of modules) that
|
||||
others have written, to do anything from composing music to accessing
|
||||
Web pages. A good deal of those modules even come with every
|
||||
installation of Perl.
|
||||
|
||||
One module that you may have used before, and which is fairly typical in
|
||||
its interface, is Text::Wrap. It comes with Perl, so you don't even
|
||||
need to install it from CPAN. You use it in a program of yours, by
|
||||
having your program code say early on:
|
||||
|
||||
use Text::Wrap;
|
||||
|
||||
and after that, you can access a function called C<wrap>, which inserts
|
||||
line-breaks in text that you feed it, so that the text will be wrapped to
|
||||
seventy-two (or however many) columns.
|
||||
|
||||
The way this C<use Text::Wrap> business works is that the module
|
||||
Text::Wrap exists as a file "Text/Wrap.pm" somewhere in one of your
|
||||
library directories. That file contains Perl code...
|
||||
|
||||
=over
|
||||
|
||||
Footnote: And mixed in with the Perl code, there's documentation, which
|
||||
is what you read with "perldoc Text::Wrap". The perldoc program simply
|
||||
ignores the code and formats the documentation text, whereas "use
|
||||
Text::Wrap" loads and runs the code while ignoring the documentation.
|
||||
|
||||
=back
|
||||
|
||||
...which, among other things, defines a function called C<Text::Wrap::wrap>,
|
||||
and then C<exports> that function, which means that when you say C<wrap>
|
||||
after having said "use Text::Wrap", you'll be actually calling the
|
||||
C<Text::Wrap::wrap> function. Some modules don't export their
|
||||
functions, so you have to call them by their full name, like
|
||||
C<Text::Wrap::wrap(...parameters...)>.
|
||||
|
||||
Regardless of whether the typical module exports the functions it
|
||||
provides, a module is basically just a container for chunks of code that
|
||||
do useful things. The way the module allows for you to interact with
|
||||
it, is its I<interface>. And when, like with Text::Wrap, its interface
|
||||
consists of functions, the module is said to have a B<functional
|
||||
interface>.
|
||||
|
||||
=over
|
||||
|
||||
Footnote: the term "function" (and therefore "functionI<al>") has
|
||||
various senses. I'm using the term here in its broadest sense, to
|
||||
refer to routines -- bits of code that are called by some name and
|
||||
which take parameters and return some value.
|
||||
|
||||
=back
|
||||
|
||||
Using modules with functional interfaces is straightforward -- instead
|
||||
of defining your own "wrap" function with C<sub wrap { ... }>, you
|
||||
entrust "use Text::Wrap" to do that for you, along with whatever other
|
||||
functions its defines and exports, according to the module's
|
||||
documentation. Without too much bother, you can even write your own
|
||||
modules to contain your frequently used functions; I suggest having a look at
|
||||
the C<perlmod> man page for more leads on doing this.
|
||||
|
||||
=head2 Modules with Object-Oriented Interfaces
|
||||
|
||||
So suppose that one day you want to write a program that will automate
|
||||
the process of C<ftp>ing a bunch of files from one server down to your
|
||||
local machine, and then off to another server.
|
||||
|
||||
A quick browse through search.cpan.org turns up the module "Net::FTP",
|
||||
which you can download and install it using normal installation
|
||||
instructions (unless your sysadmin has already installed it, as many
|
||||
have).
|
||||
|
||||
Like Text::Wrap or any other module with a familiarly functional
|
||||
interface, you start off using Net::FTP in your program by saying:
|
||||
|
||||
use Net::FTP;
|
||||
|
||||
However, that's where the similarity ends. The first hint of
|
||||
difference is that the documentation for Net::FTP refers to it as a
|
||||
B<class>. A class is a kind of module, but one that has an
|
||||
object-oriented interface.
|
||||
|
||||
Whereas modules like Text::Wrap
|
||||
provide bits of useful code as I<functions>, to be called like
|
||||
C<function(...parameters...)> or like
|
||||
C<PackageName::function(...parameters...)>, Net::FTP and other modules
|
||||
with object-oriented interfaces provide B<methods>. Methods are sort of
|
||||
like functions in that they have a name and parameters; but methods
|
||||
look different, and are different, because you have to call them with a
|
||||
syntax that has a class name or an object as a special argument. I'll
|
||||
explain the syntax for method calls, and then later explain what they
|
||||
all mean.
|
||||
|
||||
Some methods are meant to be called as B<class methods>, with the class
|
||||
name (same as the module name) as a special argument. Class methods
|
||||
look like this:
|
||||
|
||||
ClassName->methodname(parameter1, parameter2, ...)
|
||||
ClassName->methodname() # if no parameters
|
||||
ClassName->methodname # same as above
|
||||
|
||||
which you will sometimes see written:
|
||||
|
||||
methodname ClassName (parameter1, parameter2, ...)
|
||||
methodname ClassName # if no parameters
|
||||
|
||||
Basically all class methods are for making new objects, and methods that
|
||||
make objects are called "B<constructors>" (and the process of making them
|
||||
is called "constructing" or "instantiating"). Constructor methods
|
||||
typically have the name "new", or something including "new"
|
||||
("new_from_file", etc.); but they can conceivably be named
|
||||
anything -- DBI's constructor method is named "connect", for example.
|
||||
|
||||
The object that a constructor method returns is
|
||||
typically captured in a scalar variable:
|
||||
|
||||
$object = ClassName->new(param1, param2...);
|
||||
|
||||
Once you have an object (more later on exactly what that is), you can
|
||||
use the other kind of method call syntax, the syntax for B<object method>
|
||||
calls. Calling object methods is just like class methods, except
|
||||
that instead of the ClassName as the special argument,
|
||||
you use an expression that yeilds an "object". Usually this is
|
||||
just a scalar variable that you earlier captured the
|
||||
output of the constructor in. Object method calls look like this:
|
||||
|
||||
$object->methodname(parameter1, parameter2, ...);
|
||||
$object->methodname() # if no parameters
|
||||
$object->methodname # same as above
|
||||
|
||||
which is occasionally written as:
|
||||
|
||||
methodname $object (parameter1, parameter2, ...)
|
||||
methodname $object # if no parameters
|
||||
|
||||
Examples of method calls are:
|
||||
|
||||
my $session1 = Net::FTP->new("ftp.myhost.com");
|
||||
# Calls a class method "new", from class Net::FTP,
|
||||
# with the single parameter "ftp.myhost.com",
|
||||
# and saves the return value (which is, as usual,
|
||||
# an object), in $session1.
|
||||
# Could also be written:
|
||||
# new Net::FTP('ftp.myhost.com')
|
||||
$session1->login("sburke","aoeuaoeu")
|
||||
|| die "failed to login!\n";
|
||||
# calling the object method "login"
|
||||
print "Dir:\n", $session1->dir(), "\n";
|
||||
$session1->quit;
|
||||
# same as $session1->quit()
|
||||
print "Done\n";
|
||||
exit;
|
||||
|
||||
Incidentally, I suggest always using the syntaxes with parentheses and
|
||||
"->" in them,
|
||||
|
||||
=over
|
||||
|
||||
Footnote: the character-pair "->" is supposed to look like an
|
||||
arrow, not "negative greater-than"!
|
||||
|
||||
=back
|
||||
|
||||
and avoiding the syntaxes that start out "methodname $object" or
|
||||
"methodname ModuleName". When everything's going right, they all mean
|
||||
the same thing as the "->" variants, but the syntax with "->" is more
|
||||
visually distinct from function calls, as well as being immune to some
|
||||
kinds of rare but puzzling ambiguities that can arise when you're trying
|
||||
to call methods that have the same name as subroutines you've defined.
|
||||
|
||||
But, syntactic alternatives aside, all this talk of constructing objects
|
||||
and object methods begs the question -- what I<is> an object? There are
|
||||
several angles to this question that the rest of this article will
|
||||
answer in turn: what can you do with objects? what's in an object?
|
||||
what's an object value? and why do some modules use objects at all?
|
||||
|
||||
=head2 What Can You Do with Objects?
|
||||
|
||||
You've seen that you can make objects, and call object methods with
|
||||
them. But what are object methods for? The answer depends on the class:
|
||||
|
||||
A Net::FTP object represents a session between your computer and an FTP
|
||||
server. So the methods you call on a Net::FTP object are for doing
|
||||
whatever you'd need to do across an FTP connection. You make the
|
||||
session and log in:
|
||||
|
||||
my $session = Net::FTP->new('ftp.aol.com');
|
||||
die "Couldn't connect!" unless defined $session;
|
||||
# The class method call to "new" will return
|
||||
# the new object if it goes OK, otherwise it
|
||||
# will return undef.
|
||||
|
||||
$session->login('sburke', 'p@ssw3rD')
|
||||
|| die "Did I change my password again?";
|
||||
# The object method "login" will give a true
|
||||
# return value if actually logs in, otherwise
|
||||
# it'll return false.
|
||||
|
||||
You can use the session object to change directory on that session:
|
||||
|
||||
$session->cwd("/home/sburke/public_html")
|
||||
|| die "Hey, that was REALLY supposed to work!";
|
||||
# if the cwd fails, it'll return false
|
||||
|
||||
...get files from the machine at the other end of the session...
|
||||
|
||||
foreach my $f ('log_report_ua.txt', 'log_report_dom.txt',
|
||||
'log_report_browsers.txt')
|
||||
{
|
||||
$session->get($f) || warn "Getting $f failed!"
|
||||
};
|
||||
|
||||
...and plenty else, ending finally with closing the connection:
|
||||
|
||||
$session->quit();
|
||||
|
||||
In short, object methods are for doing things related to (or with)
|
||||
whatever the object represents. For FTP sessions, it's about sending
|
||||
commands to the server at the other end of the connection, and that's
|
||||
about it -- there, methods are for doing something to the world outside
|
||||
the object, and the objects is just something that specifies what bit
|
||||
of the world (well, what FTP session) to act upon.
|
||||
|
||||
With most other classes, however, the object itself stores some kind of
|
||||
information, and it typically makes no sense to do things with such an
|
||||
object without considering the data that's in the object.
|
||||
|
||||
=head2 What's I<in> an Object?
|
||||
|
||||
An object is (with rare exceptions) a data structure containing a
|
||||
bunch of attributes, each of which has a value, as well as a name
|
||||
that you use when you
|
||||
read or set the attribute's value. Some of the object's attributes are
|
||||
private, meaning you'll never see them documented because they're not
|
||||
for you to read or write; but most of the object's documented attributes
|
||||
are at least readable, and usually writeable, by you. Net::FTP objects
|
||||
are a bit thin on attributes, so we'll use objects from the class
|
||||
Business::US_Amort for this example. Business::US_Amort is a very
|
||||
simple class (available from CPAN) that I wrote for making calculations
|
||||
to do with loans (specifically, amortization, using US-style
|
||||
algorithms).
|
||||
|
||||
An object of the class Business::US_Amort represents a loan with
|
||||
particular parameters, i.e., attributes. The most basic attributes of a
|
||||
"loan object" are its interest rate, its principal (how much money it's
|
||||
for), and it's term (how long it'll take to repay). You need to set
|
||||
these attributes before anything else can be done with the object. The
|
||||
way to get at those attributes for loan objects is just like the
|
||||
way to get at attributes for any class's objects: through accessors.
|
||||
An B<accessor> is simply any method that accesses (whether reading or
|
||||
writing, AKA getting or putting) some attribute in the given object.
|
||||
Moreover, accessors are the B<only> way that you can change
|
||||
an object's attributes. (If a module's documentation wants you to
|
||||
know about any other way, it'll tell you.)
|
||||
|
||||
Usually, for simplicity's sake, an accessor is named after the attribute
|
||||
it reads or writes. With Business::US_Amort objects, the accessors you
|
||||
need to use first are C<principal>, C<interest_rate>, and C<term>.
|
||||
Then, with at least those attributes set, you can call the C<run> method
|
||||
to figure out several things about the loan. Then you can call various
|
||||
accessors, like C<total_paid_toward_interest>, to read the results:
|
||||
|
||||
use Business::US_Amort;
|
||||
my $loan = Business::US_Amort->new;
|
||||
# Set the necessary attributes:
|
||||
$loan->principal(123654);
|
||||
$loan->interest_rate(9.25);
|
||||
$loan->term(20); # twenty years
|
||||
|
||||
# NOW we know enough to calculate:
|
||||
$loan->run;
|
||||
|
||||
# And see what came of that:
|
||||
print
|
||||
"Total paid toward interest: A WHOPPING ",
|
||||
$loan->total_paid_interest, "!!\n";
|
||||
|
||||
This illustrates a convention that's common with accessors: calling the
|
||||
accessor with no arguments (as with $loan->total_paid_interest) usually
|
||||
means to read the value of that attribute, but providing a value (as
|
||||
with $loan->term(20)) means you want that attribute to be set to that
|
||||
value. This stands to reason: why would you be providing a value, if
|
||||
not to set the attribute to that value?
|
||||
|
||||
Although a loan's term, principal, and interest rates are all single
|
||||
numeric values, an objects values can any kind of scalar, or an array,
|
||||
or even a hash. Moreover, an attribute's value(s) can be objects
|
||||
themselves. For example, consider MIDI files (as I wrote about in
|
||||
TPJ#13): a MIDI file usually consists of several tracks. A MIDI file is
|
||||
complex enough to merit being an object with attributes like its overall
|
||||
tempo, the file-format variant it's in, and the list of instrument
|
||||
tracks in the file. But tracks themselves are complex enough to be
|
||||
objects too, with attributes like their track-type, a list of MIDI
|
||||
commands if they're a MIDI track, or raw data if they're not. So I
|
||||
ended up writing the MIDI modules so that the "tracks" attribute of a
|
||||
MIDI::Opus object is an array of objects from the class MIDI::Track.
|
||||
This may seem like a runaround -- you ask what's in one object, and get
|
||||
I<another> object, or several! But in this case, it exactly reflects
|
||||
what the module is for -- MIDI files contain MIDI tracks, which then
|
||||
contain data.
|
||||
|
||||
=head2 What is an Object Value?
|
||||
|
||||
When you call a constructor like Net::FTP->new(I<hostname>), you get
|
||||
back an object value, a value you can later use, in combination with a
|
||||
method name, to call object methods.
|
||||
|
||||
Now, so far we've been pretending, in the above examples, that the
|
||||
variables $session or $loan I<are> the objects you're dealing with.
|
||||
This idea is innocuous up to a point, but it's really a misconception
|
||||
that will, at best, limit you in what you know how to do. The reality
|
||||
is not that the variables $session or $query are objects; it's a little
|
||||
more indirect -- they I<hold> values that symbolize objects. The kind of
|
||||
value that $session or $query hold is what I'm calling an object value.
|
||||
|
||||
To understand what kind of value this is, first think about the other
|
||||
kinds of scalar values you know about: The first two scalar values you
|
||||
probably ever ran into in Perl are B<numbers> and B<strings>, which you
|
||||
learned (or just assumed) will usually turn into each other on demand;
|
||||
that is, the three-character string "2.5" can become the quantity two
|
||||
and a half, and vice versa. Then, especially if you started using
|
||||
C<perl -w> early on, you learned about the B<undefined value>, which can
|
||||
turn into 0 if you treat it as a number, or the empty-string if you
|
||||
treat it as a string.
|
||||
|
||||
=over
|
||||
|
||||
Footnote: You may I<also> have been learning about references, in which
|
||||
case you're ready to hear that object values are just a kind of
|
||||
reference, except that they reflect the class that created thing they point
|
||||
to, instead of merely being a plain old array reference, hash reference,
|
||||
etc. I<If> this makes makes sense to you, and you want to know more
|
||||
about how objects are implemented in Perl, have a look at the
|
||||
C<perltoot> man page.
|
||||
|
||||
=back
|
||||
|
||||
And now you're learning about B<object values>. An object value is a
|
||||
value that points to a data structure somewhere in memory, which is
|
||||
where all the attributes for this object are stored. That data
|
||||
structure as a whole belongs to a class (probably the one you named in
|
||||
the constructor method, like ClassName->new), so that the object value
|
||||
can be used as part of object method calls.
|
||||
|
||||
If you want to actually I<see> what an object value is, you might try
|
||||
just saying "print $object". That'll get you something like this:
|
||||
|
||||
Net::FTP=GLOB(0x20154240)
|
||||
|
||||
or
|
||||
|
||||
Business::US_Amort=HASH(0x15424020)
|
||||
|
||||
That's not very helpful if you wanted to really get at the object's
|
||||
insides, but that's because the object value is only a symbol for the
|
||||
object. This may all sound very abstruse and metaphysical, so a
|
||||
real-world allegory might be very helpful:
|
||||
|
||||
=over
|
||||
|
||||
You get an advertisement in the mail saying that you have been
|
||||
(im)personally selected to have the rare privilege of applying for a
|
||||
credit card. For whatever reason, I<this> offer sounds good to you, so you
|
||||
fill out the form and mail it back to the credit card company. They
|
||||
gleefully approve the application and create your account, and send you
|
||||
a card with a number on it.
|
||||
|
||||
Now, you can do things with the number on that card -- clerks at stores
|
||||
can ring up things you want to buy, and charge your account by keying in
|
||||
the number on the card. You can pay for things you order online by
|
||||
punching in the card number as part of your online order. You can pay
|
||||
off part of the account by sending the credit card people some of your
|
||||
money (well, a check) with some note (usually the pre-printed slip)
|
||||
that has the card number for the account you want to pay toward. And you
|
||||
should be able to call the credit card company's computer and ask it
|
||||
things about the card, like its balance, its credit limit, its APR, and
|
||||
maybe an itemization of recent purchases ad payments.
|
||||
|
||||
Now, what you're I<really> doing is manipulating a credit card
|
||||
I<account>, a completely abstract entity with some data attached to it
|
||||
(balance, APR, etc). But for ease of access, you have a credit card
|
||||
I<number> that is a symbol for that account. Now, that symbol is just a
|
||||
bunch of digits, and the number is effectively meaningless and useless
|
||||
in and of itself -- but in the appropriate context, it's understood to
|
||||
I<mean> the credit card account you're accessing.
|
||||
|
||||
=back
|
||||
|
||||
This is exactly the relationship between objects and object values, and
|
||||
from this analogy, several facts about object values are a bit more
|
||||
explicable:
|
||||
|
||||
* An object value does nothing in and of itself, but it's useful when
|
||||
you use it in the context of an $object->method call, the same way that
|
||||
a card number is useful in the context of some operation dealing with a
|
||||
card account.
|
||||
|
||||
Moreover, several copies of the same object value all refer to the same
|
||||
object, the same way that making several copies of your card number
|
||||
won't change the fact that they all still refer to the same single
|
||||
account (this is true whether you're "copying" the number by just
|
||||
writing it down on different slips of paper, or whether you go to the
|
||||
trouble of forging exact replicas of your own plastic credit card). That's
|
||||
why this:
|
||||
|
||||
$x = Net::FTP->new("ftp.aol.com");
|
||||
$x->login("sburke", "aoeuaoeu");
|
||||
|
||||
does the same thing as this:
|
||||
|
||||
$x = Net::FTP->new("ftp.aol.com");
|
||||
$y = $x;
|
||||
$z = $y;
|
||||
$z->login("sburke", "aoeuaoeu");
|
||||
|
||||
That is, $z and $y and $x are three different I<slots> for values,
|
||||
but what's in those slots are all object values pointing to the same
|
||||
object -- you don't have three different FTP connections, just three
|
||||
variables with values pointing to the some single FTP connection.
|
||||
|
||||
* You can't tell much of anything about the object just by looking at
|
||||
the object value, any more than you can see your credit account balance
|
||||
by holding the plastic card up to the light, or by adding up the digits
|
||||
in your credit card number.
|
||||
|
||||
* You can't just make up your own object values and have them work --
|
||||
they can come only from constructor methods of the appropriate class.
|
||||
Similarly, you get a credit card number I<only> by having a bank approve
|
||||
your application for a credit card account -- at which point I<they>
|
||||
let I<you> know what the number of your new card is.
|
||||
|
||||
Now, there's even more to the fact that you can't just make up your own
|
||||
object value: even though you can print an object value and get a string
|
||||
like "Net::FTP=GLOB(0x20154240)", that string is just a
|
||||
I<representation> of an object value.
|
||||
|
||||
Internally, an object value has a basically different type from a
|
||||
string, or a number, or the undefined value -- if $x holds a real
|
||||
string, then that value's slot in memory says "this is a value of type
|
||||
I<string>, and its characters are...", whereas if it's an object value,
|
||||
the value's slot in memory says, "this is a value of type I<reference>,
|
||||
and the location in memory that it points to is..." (and by looking at
|
||||
what's at that location, Perl can tell the class of what's there).
|
||||
|
||||
Perl programmers typically don't have to think about all these details
|
||||
of Perl's internals. Many other languages force you to be more
|
||||
conscious of the differences between all of these (and also between
|
||||
types of numbers, which are stored differently depending on their size
|
||||
and whether they have fractional parts). But Perl does its best to
|
||||
hide the different types of scalars from you -- it turns numbers into
|
||||
strings and back as needed, and takes the string or number
|
||||
representation of undef or of object values as needed. However, you
|
||||
can't go from a string representation of an object value, back to an
|
||||
object value. And that's why this doesn't work:
|
||||
|
||||
$x = Net::FTP->new('ftp.aol.com');
|
||||
$y = Net::FTP->new('ftp.netcom.com');
|
||||
$z = Net::FTP->new('ftp.qualcomm.com');
|
||||
$all = join(' ', $x,$y,$z); # !!!
|
||||
...later...
|
||||
($aol, $netcom, $qualcomm) = split(' ', $all); # !!!
|
||||
$aol->login("sburke", "aoeuaoeu");
|
||||
$netcom->login("sburke", "qjkxqjkx");
|
||||
$qualcomm->login("smb", "dhtndhtn");
|
||||
|
||||
This fails because $aol ends up holding merely the B<string representation>
|
||||
of the object value from $x, not the object value itself -- when
|
||||
C<join> tried to join the characters of the "strings" $x, $y, and $z,
|
||||
Perl saw that they weren't strings at all, so it gave C<join> their
|
||||
string representations.
|
||||
|
||||
Unfortunately, this distinction between object values and their string
|
||||
representations doesn't really fit into the analogy of credit card
|
||||
numbers, because credit card numbers really I<are> numbers -- even
|
||||
thought they don't express any meaningful quantity, if you stored them
|
||||
in a database as a quantity (as opposed to just an ASCII string),
|
||||
that wouldn't stop them from being valid as credit card numbers.
|
||||
|
||||
This may seem rather academic, but there's there's two common mistakes
|
||||
programmers new to objects often make, which make sense only in terms of
|
||||
the distinction between object values and their string representations:
|
||||
|
||||
The first common error involves forgetting (or never having known in the
|
||||
first place) that when you go to use a value as a hash key, Perl uses
|
||||
the string representation of that value. When you want to use the
|
||||
numeric value two and a half as a key, Perl turns it into the
|
||||
three-character string "2.5". But if you then want to use that string
|
||||
as a number, Perl will treat it as meaning two and a half, so you're
|
||||
usually none the wiser that Perl converted the number to a string and
|
||||
back. But recall that Perl can't turn strings back into objects -- so
|
||||
if you tried to use a Net::FTP object value as a hash key, Perl actually
|
||||
used its string representation, like "Net::FTP=GLOB(0x20154240)", but
|
||||
that string is unusable as an object value. (Incidentally, there's
|
||||
a module Tie::RefHash that implements hashes that I<do> let you use
|
||||
real object-values as keys.)
|
||||
|
||||
The second common error with object values is in
|
||||
trying to save an object value to disk (whether printing it to a
|
||||
file, or storing it in a conventional database file). All you'll get is the
|
||||
string, which will be useless.
|
||||
|
||||
When you want to save an object and restore it later, you may find that
|
||||
the object's class already provides a method specifically for this. For
|
||||
example, MIDI::Opus provides methods for writing an object to disk as a
|
||||
standard MIDI file. The file can later be read back into memory by
|
||||
a MIDI::Opus constructor method, which will return a new MIDI::Opus
|
||||
object representing whatever file you tell it to read into memory.
|
||||
Similar methods are available with, for example, classes that
|
||||
manipulate graphic images and can save them to files, which can be read
|
||||
back later.
|
||||
|
||||
But some classes, like Business::US_Amort, provide no such methods for
|
||||
storing an object in a file. When this is the case, you can try
|
||||
using any of the Data::Dumper, Storable, or FreezeThaw modules. Using
|
||||
these will be unproblematic for objects of most classes, but it may run
|
||||
into limitations with others. For example, a Business::US_Amort
|
||||
object can be turned into a string with Data::Dumper, and that string
|
||||
written to a file. When it's restored later, its attributes will be
|
||||
accessible as normal. But in the unlikely case that the loan object was
|
||||
saved in mid-calculation, the calculation may not be resumable. This is
|
||||
because of the way that that I<particular> class does its calculations,
|
||||
but similar limitations may occur with objects from other classses.
|
||||
|
||||
But often, even I<wanting> to save an object is basically wrong -- what would
|
||||
saving an ftp I<session> even mean? Saving the hostname, username, and
|
||||
password? current directory on both machines? the local TCP/IP port
|
||||
number? In the case of "saving" a Net::FTP object, you're better off
|
||||
just saving whatever details you actually need for your own purposes,
|
||||
so that you can make a new object later and just set those values for it.
|
||||
|
||||
=head2 So Why Do Some Modules Use Objects?
|
||||
|
||||
All these details of using objects are definitely enough to make you
|
||||
wonder -- is it worth the bother? If you're a module author, writing
|
||||
your module with an object-oriented interface restricts the audience of
|
||||
potential users to those who understand the basic concepts of objects
|
||||
and object values, as well as Perl's syntax for calling methods. Why
|
||||
complicate things by having an object-oriented interface?
|
||||
|
||||
A somewhat esoteric answer is that a module has an object-oriented
|
||||
interface because the module's insides are written in an
|
||||
object-oriented style. This article is about the basics of
|
||||
object-oriented I<interfaces>, and it'd be going far afield to explain
|
||||
what object-oriented I<design> is. But the short story is that
|
||||
object-oriented design is just one way of attacking messy problems.
|
||||
It's a way that many programmers find very helpful (and which others
|
||||
happen to find to be far more of a hassle than it's worth,
|
||||
incidentally), and it just happens to show up for you, the module user,
|
||||
as merely the style of interface.
|
||||
|
||||
But a simpler answer is that a functional interface is sometimes a
|
||||
hindrance, because it limits the number of things you can do at once --
|
||||
limiting it, in fact, to one. For many problems that some modules are
|
||||
meant to solve, doing without an object-oriented interface would be like
|
||||
wishing that Perl didn't use filehandles. The ideas are rather simpler
|
||||
-- just imagine that Perl let you access files, but I<only> one at a
|
||||
time, with code like:
|
||||
|
||||
open("foo.txt") || die "Can't open foo.txt: $!";
|
||||
while(readline) {
|
||||
print $_ if /bar/;
|
||||
}
|
||||
close;
|
||||
|
||||
That hypothetical kind of Perl would be simpler, by doing without
|
||||
filehandles. But you'd be out of luck if you wanted to read from
|
||||
one file while reading from another, or read from two and print to a
|
||||
third.
|
||||
|
||||
In the same way, a functional FTP module would be fine for just
|
||||
uploading files to one server at a time, but it wouldn't allow you to
|
||||
easily write programs that make need to use I<several> simultaneous
|
||||
sessions (like "look at server A and server B, and if A has a file
|
||||
called X.dat, then download it locally and then upload it to server B --
|
||||
except if B has a file called Y.dat, in which case do it the other way
|
||||
around").
|
||||
|
||||
Some kinds of problems that modules solve just lend themselves to an
|
||||
object-oriented interface. For those kinds of tasks, a functional
|
||||
interface would be more familiar, but less powerful. Learning to use
|
||||
object-oriented modules' interfaces does require becoming comfortable
|
||||
with the concepts from this article. But in the end it will allow you
|
||||
to use a broader range of modules and, with them, to write programs
|
||||
that can do more.
|
||||
|
||||
B<[end body of article]>
|
||||
|
||||
=head2 [Author Credit]
|
||||
|
||||
Sean M. Burke has contributed several modules to CPAN, about half of
|
||||
them object-oriented.
|
||||
|
||||
[The next section should be in a greybox:]
|
||||
|
||||
=head2 The Gory Details
|
||||
|
||||
For sake of clarity of explanation, I had to oversimplify some of the
|
||||
facts about objects. Here's a few of the gorier details:
|
||||
|
||||
* Every example I gave of a constructor was a class method. But object
|
||||
methods can be constructors, too, if the class was written to work that
|
||||
way: $new = $old->copy, $node_y = $node_x->new_subnode, or the like.
|
||||
|
||||
* I've given the impression that there's two kinds of methods: object
|
||||
methods and class methods. In fact, the same method can be both,
|
||||
because it's not the kind of method it is, but the kind of calls it's
|
||||
written to accept -- calls that pass an object, or calls that pass a
|
||||
class-name.
|
||||
|
||||
* The term "object value" isn't something you'll find used much anywhere
|
||||
else. It's just my shorthand for what would properly be called an
|
||||
"object reference" or "reference to a blessed item". In fact, people
|
||||
usually say "object" when they properly mean a reference to that object.
|
||||
|
||||
* I mentioned creating objects with I<con>structors, but I didn't
|
||||
mention destroying them with I<de>structor -- a destructor is a kind of
|
||||
method that you call to tidy up the object once you're done with it, and
|
||||
want it to neatly go away (close connections, delete temporary files,
|
||||
free up memory, etc). But because of the way Perl handles memory,
|
||||
most modules won't require the user to know about destructors.
|
||||
|
||||
* I said that class method syntax has to have the class name, as in
|
||||
$session = B<Net::FTP>->new($host). Actually, you can instead use any
|
||||
expression that returns a class name: $ftp_class = 'Net::FTP'; $session
|
||||
= B<$ftp_class>->new($host). Moreover, instead of the method name for
|
||||
object- or class-method calls, you can use a scalar holding the method
|
||||
name: $foo->B<$method>($host). But, in practice, these syntaxes are
|
||||
rarely useful.
|
||||
|
||||
And finally, to learn about objects from the perspective of writing
|
||||
your own classes, see the C<perltoot> documentation,
|
||||
or Damian Conway's exhaustive and clear book I<Object Oriented Perl>
|
||||
(Manning Publications 1999, ISBN 1-884777-79-1).
|
||||
|
||||
=head1 BACK
|
||||
|
||||
Return to the L<HTML::Tree|HTML::Tree> docs.
|
||||
|
||||
=cut
|
||||
|
||||
1369
database/perl/vendor/lib/HTML/Tree/AboutTrees.pod
vendored
Normal file
1369
database/perl/vendor/lib/HTML/Tree/AboutTrees.pod
vendored
Normal file
File diff suppressed because it is too large
Load Diff
714
database/perl/vendor/lib/HTML/Tree/Scanning.pod
vendored
Normal file
714
database/perl/vendor/lib/HTML/Tree/Scanning.pod
vendored
Normal file
@@ -0,0 +1,714 @@
|
||||
|
||||
#Time-stamp: "2001-03-10 23:19:11 MST" -*-Text-*-
|
||||
# This document contains text in Perl "POD" format.
|
||||
# Use a POD viewer like perldoc or perlman to render it.
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tree::Scanning -- article: "Scanning HTML"
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
# This an article, not a module.
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The following article by Sean M. Burke first appeared in I<The Perl
|
||||
Journal> #19 and is copyright 2000 The Perl Journal. It appears
|
||||
courtesy of Jon Orwant and The Perl Journal. This document may be
|
||||
distributed under the same terms as Perl itself.
|
||||
|
||||
(Note that this is discussed in chapters 6 through 10 of the
|
||||
book I<Perl and LWP> L<http://lwp.interglacial.com/> which
|
||||
was written after the following documentation, and which is
|
||||
available free online.)
|
||||
|
||||
=head1 Scanning HTML
|
||||
|
||||
-- Sean M. Burke
|
||||
|
||||
In I<The Perl Journal> issue 17, Ken MacFarlane's article "Parsing
|
||||
HTML with HTML::Parser" describes how the HTML::Parser module scans
|
||||
HTML source as a stream of start-tags, end-tags, text, comments, etc.
|
||||
In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
|
||||
data structures. Now I'll try to tie it together, in a discussion of
|
||||
HTML trees.
|
||||
|
||||
The CPAN module HTML::TreeBuilder takes the
|
||||
tags that HTML::Parser picks out, and builds a parse tree -- a
|
||||
tree-shaped network of objects...
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
And if you need a quick explanation of objects, see my TPJ17 article "A
|
||||
User's View of Object-Oriented Modules"; or go whole hog and get Damian
|
||||
Conway's excellent book I<Object-Oriented Perl>, from Manning
|
||||
Publications.
|
||||
|
||||
=back
|
||||
|
||||
...representing the structured content of the HTML document. And once
|
||||
the document is parsed as a tree, you'll find the common tasks
|
||||
of extracting data from that HTML document/tree to be quite
|
||||
straightforward.
|
||||
|
||||
=head2 HTML::Parser, HTML::TreeBuilder, and HTML::Element
|
||||
|
||||
You use HTML::TreeBuilder to make a parse tree out of an HTML source
|
||||
file, by simply saying:
|
||||
|
||||
use HTML::TreeBuilder;
|
||||
my $tree = HTML::TreeBuilder->new();
|
||||
$tree->parse_file('foo.html');
|
||||
|
||||
and then C<$tree> contains a parse tree built from the HTML source from
|
||||
the file "foo.html". The way this parse tree is represented is with a
|
||||
network of objects -- C<$tree> is the root, an element with tag-name
|
||||
"html", and its children typically include a "head" and "body" element,
|
||||
and so on. Elements in the tree are objects of the class
|
||||
HTML::Element.
|
||||
|
||||
So, if you take this source:
|
||||
|
||||
<html><head><title>Doc 1</title></head>
|
||||
<body>
|
||||
Stuff <hr> 2000-08-17
|
||||
</body></html>
|
||||
|
||||
and feed it to HTML::TreeBuilder, it'll return a tree of objects that
|
||||
looks like this:
|
||||
|
||||
html
|
||||
/ \
|
||||
head body
|
||||
/ / | \
|
||||
title "Stuff" hr "2000-08-17"
|
||||
|
|
||||
"Doc 1"
|
||||
|
||||
This is a pretty simple document, but if it were any more complex,
|
||||
it'd be a bit hard to draw in that style, since it's sprawl left and
|
||||
right. The same tree can be represented a bit more easily sideways,
|
||||
with indenting:
|
||||
|
||||
. html
|
||||
. head
|
||||
. title
|
||||
. "Doc 1"
|
||||
. body
|
||||
. "Stuff"
|
||||
. hr
|
||||
. "2000-08-17"
|
||||
|
||||
Either way expresses the same structure. In that structure, the root
|
||||
node is an object of the class HTML::Element
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
Well actually, the root is of the class HTML::TreeBuilder, but that's
|
||||
just a subclass of HTML::Element, plus the few extra methods like
|
||||
C<parse_file> that elaborate the tree
|
||||
|
||||
=back
|
||||
|
||||
, with the tag name "html", and with two children: an HTML::Element
|
||||
object whose tag names are "head" and "body". And each of those
|
||||
elements have children, and so on down. Not all elements (as we'll
|
||||
call the objects of class HTML::Element) have children -- the "hr"
|
||||
element doesn't. And note all nodes in the tree are elements -- the
|
||||
text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
|
||||
|
||||
Objects of the class HTML::Element each have three noteworthy attributes:
|
||||
|
||||
=over
|
||||
|
||||
=item C<_tag> -- (best accessed as C<$e-E<gt>tag>)
|
||||
this element's tag-name, lowercased (e.g., "em" for an "em" element).
|
||||
|
||||
=over
|
||||
|
||||
Footnote: Yes, this is misnamed. In proper SGML terminology, this is
|
||||
instead called a "GI", short for "generic identifier"; and the term
|
||||
"tag" is used for a token of SGML source that represents either
|
||||
the start of an element (a start-tag like "<em lang='fr'>") or the end
|
||||
of an element (an end-tag like "</em>". However, since more people
|
||||
claim to have been abducted by aliens than to have ever seen the
|
||||
SGML standard, and since both encounters typically involve a feeling of
|
||||
"missing time", it's not surprising that the terminology of the SGML
|
||||
standard is not closely followed.
|
||||
|
||||
=back
|
||||
|
||||
=item C<_parent> -- (best accessed as C<$e-E<gt>parent>)
|
||||
the element that is C<$obj>'s parent, or undef if this element is the
|
||||
root of its tree.
|
||||
|
||||
=item C<_content> -- (best accessed as C<$e-E<gt>content_list>)
|
||||
the list of nodes (i.e., elements or text segments) that are C<$e>'s
|
||||
children.
|
||||
|
||||
=back
|
||||
|
||||
Moreover, if an element object has any attributes in the SGML sense of
|
||||
the word, then those are readable as C<$e-E<gt>attr('name')> -- for
|
||||
example, with the object built from having parsed "E<lt>a
|
||||
B<id='foo'>E<gt>barE<lt>/aE<gt>", C<$e-E<gt>attr('id')> will return
|
||||
the string "foo". Moreover, C<$e-E<gt>tag> on that object returns the
|
||||
string "a", C<$e-E<gt>content_list> returns a list consisting of just
|
||||
the single scalar "bar", and C<$e-E<gt>parent> returns the object
|
||||
that's this node's parent -- which may be, for example, a "p" element.
|
||||
|
||||
And that's all that there is to it -- you throw HTML
|
||||
source at TreeBuilder, and it returns a tree built of HTML::Element
|
||||
objects and some text strings.
|
||||
|
||||
However, what do you I<do> with a tree of objects? People code
|
||||
information into HTML trees not for the fun of arranging elements, but
|
||||
to represent the structure of specific text and images -- some text is
|
||||
in this "li" element, some other text is in that heading, some
|
||||
images are in that other table cell that has those attributes, and so on.
|
||||
|
||||
Now, it may happen that you're rendering that whole HTML tree into some
|
||||
layout format. Or you could be trying to make some systematic change to
|
||||
the HTML tree before dumping it out as HTML source again. But, in my
|
||||
experience, by far the most common programming task that Perl
|
||||
programmers face with HTML is in trying to extract some piece
|
||||
of information from a larger document. Since that's so common (and
|
||||
also since it involves concepts that are basic to more complex tasks),
|
||||
that is what the rest of this article will be about.
|
||||
|
||||
=head2 Scanning HTML trees
|
||||
|
||||
Suppose you have a thousand HTML documents, each of them a press
|
||||
release. They all start out:
|
||||
|
||||
[...lots of leading images and junk...]
|
||||
<h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
|
||||
BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
|
||||
of world conquest, Rock Feldspar, announced today the opening of a
|
||||
new office in Ougadougou, the capital city of Burkino Faso, gateway
|
||||
to the bustling "Silicon Sahara" of Africa...
|
||||
[...etc...]
|
||||
|
||||
...and what you've got to do is, for each document, copy whatever text
|
||||
is in the "h1" element, so that you can, for example, make a table of
|
||||
contents of it. Now, there are three ways to do this:
|
||||
|
||||
=over
|
||||
|
||||
=item * You can just use a regexp to scan the file for a text pattern.
|
||||
|
||||
For many very simple tasks, this will do fine. Many HTML documents are,
|
||||
in practice, very consistently formatted as far as placement of
|
||||
linebreaks and whitespace, so you could just get away with scanning the
|
||||
file like so:
|
||||
|
||||
sub get_heading {
|
||||
my $filename = $_[0];
|
||||
local *HTML;
|
||||
open(HTML, $filename)
|
||||
or die "Couldn't open $filename);
|
||||
my $heading;
|
||||
Line:
|
||||
while(<HTML>) {
|
||||
if( m{<h1>(.*?)</h1>}i ) { # match it!
|
||||
$heading = $1;
|
||||
last Line;
|
||||
}
|
||||
}
|
||||
close(HTML);
|
||||
warn "No heading in $filename?"
|
||||
unless defined $heading;
|
||||
return $heading;
|
||||
}
|
||||
|
||||
This is quick and fast, but awfully fragile -- if there's a newline in
|
||||
the middle of a heading's text, it won't match the above regexp, and
|
||||
you'll get an error. The regexp will also fail if the "h1" element's
|
||||
start-tag has any attributes. If you have to adapt your code to fit
|
||||
more kinds of start-tags, you'll end up basically reinventing part of
|
||||
HTML::Parser, at which point you should probably just stop, and use
|
||||
HTML::Parser itself:
|
||||
|
||||
=item * You can use HTML::Parser to scan the file for an "h1" start-tag
|
||||
token, then capture all the text tokens until the "h1" close-tag. This
|
||||
approach is extensively covered in the Ken MacFarlane's TPJ17 article
|
||||
"Parsing HTML with HTML::Parser". (A variant of this approach is to use
|
||||
HTML::TokeParser, which presents a different and rather handier
|
||||
interface to the tokens that HTML::Parser picks out.)
|
||||
|
||||
Using HTML::Parser is less fragile than our first approach, since it's
|
||||
not sensitive to the exact internal formatting of the start-tag (much
|
||||
less whether it's split across two lines). However, when you need more
|
||||
information about the context of the "h1" element, or if you're having
|
||||
to deal with any of the tricky bits of HTML, such as parsing of tables,
|
||||
you'll find out the flat list of tokens that HTML::Parser returns
|
||||
isn't immediately useful. To get something useful out of those tokens,
|
||||
you'll need to write code that knows some things about what elements
|
||||
take no content (as with "hr" elements), and that a "</p>" end-tags
|
||||
are omissible, so a "<p>" will end any currently
|
||||
open paragraph -- and you're well on your way to pointlessly
|
||||
reinventing much of the code in HTML::TreeBuilder
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
And, as the person who last rewrote that module, I can attest that it
|
||||
wasn't terribly easy to get right! Never underestimate the perversity
|
||||
of people coding HTML.
|
||||
|
||||
=back
|
||||
|
||||
, at which point you should probably just stop, and use
|
||||
HTML::TreeBuilder itself:
|
||||
|
||||
=item * You can use HTML::Treebuilder, and scan the tree of element
|
||||
objects that you get back.
|
||||
|
||||
=back
|
||||
|
||||
The last approach, using HTML::TreeBuilder, is the diametric opposite of
|
||||
first approach: The first approach involves just elementary Perl and one
|
||||
regexp, whereas the TreeBuilder approach involves being at home with
|
||||
the concept of tree-shaped data structures and modules with
|
||||
object-oriented interfaces, as well as with the particular interfaces
|
||||
that HTML::TreeBuilder and HTML::Element provide.
|
||||
|
||||
However, what the TreeBuilder approach has going for it is that it's
|
||||
the most robust, because it involves dealing with HTML in its "native"
|
||||
format -- it deals with the tree structure that HTML code represents,
|
||||
without any consideration of how the source is coded and with what
|
||||
tags omitted.
|
||||
|
||||
So, to extract the text from the "h1" elements of an HTML document:
|
||||
|
||||
sub get_heading {
|
||||
my $tree = HTML::TreeBuilder->new;
|
||||
$tree->parse_file($_[0]); # !
|
||||
my $heading;
|
||||
my $h1 = $tree->look_down('_tag', 'h1'); # !
|
||||
if($h1) {
|
||||
$heading = $h1->as_text; # !
|
||||
} else {
|
||||
warn "No heading in $_[0]?";
|
||||
}
|
||||
$tree->delete; # clear memory!
|
||||
return $heading;
|
||||
}
|
||||
|
||||
This uses some unfamiliar methods that need explaining. The
|
||||
C<parse_file> method that we've seen before, builds a tree based on
|
||||
source from the file given. The C<delete> method is for marking a
|
||||
tree's contents as available for garbage collection, when you're done
|
||||
with the tree. The C<as_text> method returns a string that contains
|
||||
all the text bits that are children (or otherwise descendants) of the
|
||||
given node -- to get the text content of the C<$h1> object, we could
|
||||
just say:
|
||||
|
||||
$heading = join '', $h1->content_list;
|
||||
|
||||
but that will work only if we're sure that the "h1" element's children
|
||||
will be only text bits -- if the document contained:
|
||||
|
||||
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
||||
|
||||
then the sub-tree would be:
|
||||
|
||||
. h1
|
||||
. "Local Man Sees "
|
||||
. cite
|
||||
. "Blade"
|
||||
. " Again'
|
||||
|
||||
so C<join '', $h1-E<gt>content_list> will be something like:
|
||||
|
||||
Local Man Sees HTML::Element=HASH(0x15424040) Again
|
||||
|
||||
whereas C<$h1-E<gt>as_text> would yield:
|
||||
|
||||
Local Man Sees Blade Again
|
||||
|
||||
and depending on what you're doing with the heading text, you might
|
||||
want the C<as_HTML> method instead. It returns the (sub)tree
|
||||
represented as HTML source. C<$h1-E<gt>as_HTML> would yield:
|
||||
|
||||
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
||||
|
||||
However, if you wanted the contents of C<$h1> as HTML, but not the
|
||||
C<$h1> itself, you could say:
|
||||
|
||||
join '',
|
||||
map(
|
||||
ref($_) ? $_->as_HTML : $_,
|
||||
$h1->content_list
|
||||
)
|
||||
|
||||
This C<map> iterates over the nodes in C<$h1>'s list of children; and
|
||||
for each node that's just a text bit (as "Local Man Sees " is), it just
|
||||
passes through that string value, and for each node that's an actual
|
||||
object (causing C<ref> to be true), C<as_HTML> will used instead of the
|
||||
string value of the object itself (which would be something quite
|
||||
useless, as most object values are). So that C<as_HTML> for the "cite"
|
||||
element will be the string "<cite>BladeE<lt>/cite>". And then,
|
||||
finally, C<join> just puts into one string all the strings that the
|
||||
C<map> returns.
|
||||
|
||||
Last but not least, the most important method in our C<get_heading> sub
|
||||
is the C<look_down> method. This method looks down at the subtree
|
||||
starting at the given object (C<$h1>), looking for elements that meet
|
||||
criteria you provide.
|
||||
|
||||
The criteria are specified in the method's argument list. Each
|
||||
criterion can consist of two scalars, a key and a value, which express
|
||||
that you want elements that have that attribute (like "_tag", or
|
||||
"src") with the given value ("h1"); or the criterion can be a
|
||||
reference to a subroutine that, when called on the given element,
|
||||
returns true if that is a node you're looking for. If you specify
|
||||
several criteria, then that's taken to mean that you want all the
|
||||
elements that each satisfy I<all> the criteria. (In other words,
|
||||
there's an "implicit AND".)
|
||||
|
||||
And finally, there's a bit of an optimization -- if you call the
|
||||
C<look_down> method in a scalar context, you get just the I<first> node
|
||||
(or undef if none) -- and, in fact, once C<look_down> finds that first
|
||||
matching element, it doesn't bother looking any further.
|
||||
|
||||
So the example:
|
||||
|
||||
$h1 = $tree->look_down('_tag', 'h1');
|
||||
|
||||
returns the first element at-or-under C<$tree> whose C<"_tag">
|
||||
attribute has the value C<"h1">.
|
||||
|
||||
=head2 Complex Criteria in Tree Scanning
|
||||
|
||||
Now, the above C<look_down> code looks like a lot of bother, with
|
||||
barely more benefit than just grepping the file! But consider if your
|
||||
criteria were more complicated -- suppose you found that some of the
|
||||
press releases that you were scanning had several "h1" elements,
|
||||
possibly before or after the one you actually want. For example:
|
||||
|
||||
<h1><center>Visit Our Corporate Partner
|
||||
<br><a href="/dyna/clickthru"
|
||||
><img src="/dyna/vend_ad"></a>
|
||||
</center></h1>
|
||||
<h1><center>ConGlomCo President Schreck to Visit Regional HQ
|
||||
<br><a href="/photos/Schreck_visit_large.jpg"
|
||||
><img src="/photos/Schreck_visit.jpg"></a>
|
||||
</center></h1>
|
||||
|
||||
Here, you want to ignore the first "h1" element because it contains an
|
||||
ad, and you want the text from the second "h1". The problem is in
|
||||
formalizing the way you know that it's an ad. Since ad banners are
|
||||
always entreating you to "visit" the sponsoring site, you could exclude
|
||||
"h1" elements that contain the word "visit" under them:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
$_[0]->as_text !~ m/\bvisit/i
|
||||
}
|
||||
);
|
||||
|
||||
The first criterion looks for "h1" elements, and the second criterion
|
||||
limits those to only the ones whose text content doesn't match
|
||||
C<m/\bvisit/>. But unfortunately, that won't work for our example,
|
||||
since the second "h1" mentions "ConGlomCo President Schreck to
|
||||
I<Visit> Regional HQ".
|
||||
|
||||
Instead you could try looking for the first "h1" element that
|
||||
doesn't contain an image:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
not $_[0]->look_down('_tag', 'img')
|
||||
}
|
||||
);
|
||||
|
||||
This criterion sub might seem a bit odd, since it calls C<look_down>
|
||||
as part of a larger C<look_down> operation, but that's fine. Note that
|
||||
when considered as a boolean value, a C<look_down> in a scalar context
|
||||
value returns false (specifically, undef) if there's no matching element
|
||||
at or under the given element; and it returns the first matching
|
||||
element (which, being a reference and object, is always a true value),
|
||||
if any matches. So, here,
|
||||
|
||||
sub {
|
||||
not $_[0]->look_down('_tag', 'img')
|
||||
}
|
||||
|
||||
means "return true only if this element has no 'img' element as
|
||||
descendants (and isn't an 'img' element itself)."
|
||||
|
||||
This correctly filters out the first "h1" that contains the ad, but it
|
||||
also incorrectly filters out the second "h1" that contains a
|
||||
non-advertisement photo besides the headline text you want.
|
||||
|
||||
There clearly are detectable differences between the first and second
|
||||
"h1" elements -- the only second one contains the string "Schreck", and
|
||||
we could just test for that:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
$_[0]->as_text =~ m{Schreck}
|
||||
}
|
||||
);
|
||||
|
||||
And that works fine for this one example, but unless all thousand of
|
||||
your press releases have "Schreck" in the headline, that's just not a
|
||||
general solution. However, if all the ads-in-"h1"s that you want to
|
||||
exclude involve a link whose URL involves "/dyna/", then you can use
|
||||
that:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
my $link = $_[0]->look_down('_tag','a');
|
||||
return 1 unless $link;
|
||||
# no link means it's fine
|
||||
return 0 if $link->attr('href') =~ m{/dyna/};
|
||||
# a link to there is bad
|
||||
return 1; # otherwise okay
|
||||
}
|
||||
);
|
||||
|
||||
Or you can look at it another way and say that you want the first "h1"
|
||||
element that either contains no images, or else whose image has a "src"
|
||||
attribute whose value contains "/photos/":
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
my $img = $_[0]->look_down('_tag','img');
|
||||
return 1 unless $img;
|
||||
# no image means it's fine
|
||||
return 1 if $img->attr('src') =~ m{/photos/};
|
||||
# good if a photo
|
||||
return 0; # otherwise bad
|
||||
}
|
||||
);
|
||||
|
||||
Recall that this use of C<look_down> in a scalar context means to return
|
||||
the first element at or under C<$tree> that matches all the criteria.
|
||||
But if you notice that you can formulate criteria that'll match several
|
||||
possible "h1" elements, some of which may be bogus but the I<last> one
|
||||
of which is always the one you want, then you can use C<look_down> in a
|
||||
list context, and just use the last element of that list:
|
||||
|
||||
my @h1s = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
...maybe more criteria...
|
||||
);
|
||||
die "What, no h1s here?" unless @h1s;
|
||||
my $real_h1 = $h1s[-1]; # last or only
|
||||
|
||||
=head2 A Case Study: Scanning Yahoo News's HTML
|
||||
|
||||
The above (somewhat contrived) case involves extracting data from a
|
||||
bunch of pre-existing HTML files. In that sort of situation, if your
|
||||
code works for all the files, then you know that the code I<works> --
|
||||
since the data it's meant to handle won't go changing or growing; and,
|
||||
typically, once you've used the program, you'll never need to use it
|
||||
again.
|
||||
|
||||
The other kind of situation faced in many data extraction tasks is
|
||||
where the program is used recurringly to handle new data -- such as
|
||||
from ever-changing Web pages. As a real-world example of this,
|
||||
consider a program that you could use (suppose it's crontabbed) to
|
||||
extract headline-links from subsections of Yahoo News
|
||||
(C<http://dailynews.yahoo.com/>).
|
||||
|
||||
Yahoo News has several subsections:
|
||||
|
||||
=over
|
||||
|
||||
=item http://dailynews.yahoo.com/h/tc/ for technology news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/sc/ for science news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/hl/ for health news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/wl/ for world news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/en/ for entertainment news
|
||||
|
||||
=back
|
||||
|
||||
and others. All of them are built on the same basic HTML template --
|
||||
and a scarily complicated template it is, especially when you look at
|
||||
it with an eye toward making up rules that will select where the real
|
||||
headline-links are, while screening out all the links to other parts of
|
||||
Yahoo, other news services, etc. You will need to puzzle
|
||||
over the HTML source, and scrutinize the output of
|
||||
C<$tree-E<gt>dump> on the parse tree of that HTML.
|
||||
|
||||
Sometimes the only way to pin down what you're after is by position in
|
||||
the tree. For example, headlines of interest may be in the third
|
||||
column of the second row of the second table element in a page:
|
||||
|
||||
my $table = ( $tree->look_down('_tag','table') )[1];
|
||||
my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
|
||||
my $col3 = ( $row2->look-down('_tag', 'td') )[2];
|
||||
...then do things with $col3...
|
||||
|
||||
Or they may be all the links in a "p" element that has at least three
|
||||
"br" elements as children:
|
||||
|
||||
my $p = $tree->look_down(
|
||||
'_tag', 'p',
|
||||
sub {
|
||||
2 < grep { ref($_) and $_->tag eq 'br' }
|
||||
$_[0]->content_list
|
||||
}
|
||||
);
|
||||
@links = $p->look_down('_tag', 'a');
|
||||
|
||||
But almost always, you can get away with looking for properties of the
|
||||
of the thing itself, rather than just looking for contexts. Now, if
|
||||
you're lucky, the document you're looking through has clear semantic
|
||||
tagging, such is as useful in CSS -- note the
|
||||
class="headlinelink" bit here:
|
||||
|
||||
<a href="...long_news_url..." class="headlinelink">Elvis
|
||||
seen in tortilla</a>
|
||||
|
||||
If you find anything like that, you could leap right in and select
|
||||
links with:
|
||||
|
||||
@links = $tree->look_down('class','headlinelink');
|
||||
|
||||
Regrettably, your chances of seeing any sort of semantic markup
|
||||
principles really being followed with actual HTML are pretty thin.
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
In fact, your chances of finding a page that is simply free of HTML
|
||||
errors are even thinner. And surprisingly, sites like Amazon or Yahoo
|
||||
are typically worse as far as quality of code than personal sites
|
||||
whose entire production cycle involves simply being saved and uploaded
|
||||
from Netscape Composer.
|
||||
|
||||
=back
|
||||
|
||||
The code may be sort of "accidentally semantic", however -- for example,
|
||||
in a set of pages I was scanning recently, I found that looking for
|
||||
"td" elements with a "width" attribute value of "375" got me exactly
|
||||
what I wanted. No-one designing that page ever conceived of
|
||||
"width=375" as I<meaning> "this is a headline", but if you impute it
|
||||
to mean that, it works.
|
||||
|
||||
An approach like this happens to work for the Yahoo News code, because
|
||||
the headline-links are distinguished by the fact that they (and they
|
||||
alone) contain a "b" element:
|
||||
|
||||
<a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
|
||||
|
||||
or, diagrammed as a part of the parse tree:
|
||||
|
||||
. a [href="...long_news_url..."]
|
||||
. b
|
||||
. "Elvis seen in tortilla"
|
||||
|
||||
A rule that matches these can be formalized as "look for any 'a'
|
||||
element that has only one daughter node, which must be a 'b' element".
|
||||
And this is what it looks like when cooked up as a C<look_down>
|
||||
expression and prefaced with a bit of code that retrieves the text of
|
||||
the given Yahoo News page and feeds it to TreeBuilder:
|
||||
|
||||
use strict;
|
||||
use HTML::TreeBuilder 2.97;
|
||||
use LWP::UserAgent;
|
||||
sub get_headlines {
|
||||
my $url = $_[0] || die "What URL?";
|
||||
|
||||
my $response = LWP::UserAgent->new->request(
|
||||
HTTP::Request->new( GET => $url )
|
||||
);
|
||||
unless($response->is_success) {
|
||||
warn "Couldn't get $url: ", $response->status_line, "\n";
|
||||
return;
|
||||
}
|
||||
|
||||
my $tree = HTML::TreeBuilder->new();
|
||||
$tree->parse($response->content);
|
||||
$tree->eof;
|
||||
|
||||
my @out;
|
||||
foreach my $link (
|
||||
$tree->look_down( # !
|
||||
'_tag', 'a',
|
||||
sub {
|
||||
return unless $_[0]->attr('href');
|
||||
my @c = $_[0]->content_list;
|
||||
@c == 1 and ref $c[0] and $c[0]->tag eq 'b';
|
||||
}
|
||||
)
|
||||
) {
|
||||
push @out, [ $link->attr('href'), $link->as_text ];
|
||||
}
|
||||
|
||||
warn "Odd, fewer than 6 stories in $url!" if @out < 6;
|
||||
$tree->delete;
|
||||
return @out;
|
||||
}
|
||||
|
||||
...and add a bit of code to actually call that routine and display the
|
||||
results...
|
||||
|
||||
foreach my $section (qw[tc sc hl wl en]) {
|
||||
my @links = get_headlines(
|
||||
"http://dailynews.yahoo.com/h/$section/"
|
||||
);
|
||||
print
|
||||
$section, ": ", scalar(@links), " stories\n",
|
||||
map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
|
||||
"\n";
|
||||
}
|
||||
|
||||
And we've got our own headline-extractor service! This in and of
|
||||
itself isn't no amazingly useful (since if you want to see the
|
||||
headlines, you I<can> just look at the Yahoo News pages), but it could
|
||||
easily be the basis for quite useful features like filtering the
|
||||
headlines for matching certain keywords of interest to you.
|
||||
|
||||
Now, one of these days, Yahoo News will decide to change its HTML
|
||||
template. When this happens, this will appear to the above program as
|
||||
there being no links that meet the given criteria; or, less likely,
|
||||
dozens of erroneous links will meet the criteria. In either case, the
|
||||
criteria will have to be changed for the new template; they may just
|
||||
need adjustment, or you may need to scrap them and start over.
|
||||
|
||||
=head2 I<Regardez, duvet!>
|
||||
|
||||
It's often quite a challenge to write criteria to match the desired
|
||||
parts of an HTML parse tree. Very often you I<can> pull it off with a
|
||||
simple C<$tree-E<gt>look_down('_tag', 'h1')>, but sometimes you do
|
||||
have to keep adding and refining criteria, until you might end up with
|
||||
complex filters like what I've shown in this article. The
|
||||
benefit to learning how to deal with HTML parse trees is that one main
|
||||
search tool, the C<look_down> method, can do most of the work, making
|
||||
simple things easy, while still making hard things possible.
|
||||
|
||||
B<[end body of article]>
|
||||
|
||||
=head2 [Author Credit]
|
||||
|
||||
Sean M. Burke (C<sburke@cpan.org>) is the current maintainer of
|
||||
C<HTML::TreeBuilder> and C<HTML::Element>, both originally by
|
||||
Gisle Aas.
|
||||
|
||||
Sean adds: "I'd like to thank the folks who listened to me ramble
|
||||
incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet
|
||||
Another Perl Conference and O'Reilly Open Source Software Convention."
|
||||
|
||||
=head1 BACK
|
||||
|
||||
Return to the L<HTML::Tree|HTML::Tree> docs.
|
||||
|
||||
=cut
|
||||
|
||||
2297
database/perl/vendor/lib/HTML/TreeBuilder.pm
vendored
Normal file
2297
database/perl/vendor/lib/HTML/TreeBuilder.pm
vendored
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user