Initial Commit

This commit is contained in:
Riley Schneider
2025-12-03 16:38:10 +01:00
parent c5e26bf594
commit b732d8d4b5
17680 changed files with 5977495 additions and 2 deletions

203
database/perl/vendor/lib/HTML/AsSubs.pm vendored Normal file
View File

@@ -0,0 +1,203 @@
package HTML::AsSubs;
# ABSTRACT: functions that construct a HTML syntax tree
use warnings;
use strict;
use vars qw(@ISA @EXPORT);
our $VERSION = '5.07'; # VERSION from OurPkgVersion
require HTML::Element;
require Exporter;
@ISA = qw(Exporter);
# Problem: exports so damned much. Has no concept of "export only HTML4
# elements". TODO:?? make something that make functions that just
# wrap XML::Generator calls?
use vars qw(@TAGS);
@TAGS = qw(html
head title base link meta isindex nextid script style
body h1 h2 h3 h4 h5 h6 p pre div blockquote
a img br hr
ol ul dir menu li
dl dt dd
dfn cite code em kbd samp strong var address span
b i u tt
center font big small strike
sub sup
table tr td th caption
form input select option textarea
object applet param
map area
frame frameset noframe
);
for (@TAGS) {
my $code;
$code = "sub $_ { _elem('$_', \@_); }\n";
push( @EXPORT, $_ );
## no critic
eval $code;
## use critic
if ($@) {
die $@;
}
}
sub _elem {
my $tag = shift;
my $attributes;
if ( @_ and defined $_[0] and ref( $_[0] ) eq "HASH" ) {
$attributes = shift;
}
my $elem = HTML::Element->new( $tag, %$attributes );
$elem->push_content(@_);
$elem;
}
1;
__END__
=pod
=head1 NAME
HTML::AsSubs - functions that construct a HTML syntax tree
=head1 VERSION
This document describes version 5.07 of
HTML::AsSubs, released August 31, 2017
as part of L<HTML-Tree|HTML::Tree>.
=head1 SYNOPSIS
use HTML::AsSubs;
$h = body(
h1("This is the heading"),
p("This is the first paragraph which contains a ",
a({href=>'link.html'}, "link"),
" and an ",
img({src=>'img.gif', alt=>'image'}),
"."
),
);
print $h->as_HTML;
=head1 DESCRIPTION
This module exports functions that can be used to construct various
HTML elements. The functions are named after the tags of the
corresponding HTML element and are all written in lower case. If the
first argument is a hash reference then it will be used to initialize the
attributes of this element. The remaining arguments are regarded as
content.
For a similar idea (i.e., it's another case where the syntax tree
of the Perl source mirrors the syntax tree of the HTML produced),
see HTML::Element's C<new_from_lol> method.
For what I now think is a cleaner implementation of this same idea,
see the excellent module C<XML::Generator>, which is what I suggest
for actual real-life use. (I suggest this over C<HTML::AsSubs> and
over C<CGI.pm>'s HTML-making functions.)
=head1 ACKNOWLEDGEMENT
This module was inspired by the following message:
Date: Tue, 4 Oct 1994 16:11:30 +0100
Subject: Wow! I have a large lightbulb above my head!
Take a moment to consider these lines:
%OVERLOAD=( '""' => sub { join("", @{$_[0]}) } );
sub html { my($type)=shift; bless ["<$type>", @_, "</$type>"]; }
:-) I *love* Perl 5! Thankyou Larry and Ilya.
Regards,
Tim Bunce.
p.s. If you didn't get it, think about recursive data types: html(html())
p.p.s. I'll turn this into a much more practical example in a day or two.
p.p.p.s. It's a pity that overloads are not inherited. Is this a bug?
=head1 BUGS
The exported link() function overrides the builtin link() function.
The exported tr() function must be called using &tr(...) syntax
because it clashes with the builtin tr/../../ operator.
=head1 SEE ALSO
L<HTML::Element>, L<XML::Generator>
=head2 html head title base link meta isindex nextid script style body h1 h2 h3 h4 h5 h6 p pre div blockquote a img br hr ol ul dir menu li dl dt dd dfn cite code em kbd samp strong var address span b i u tt center font big small strike sub sup table tr td th caption form input select option textarea object applet param map area frame frameset noframe
A bunch of methods for creating tags.
=head1 Private Functions
=head2 _elem()
The _elem() function is wrapped by all the html 'tag' functions. It
takes a tag-name, optional hashref of attributes and a list of content
as parameters.
=head1 AUTHOR
Current maintainers:
=over
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
=back
Original HTML-Tree author:
=over
=item * Gisle Aas
=back
Former maintainers:
=over
=item * Sean M. Burke
=item * Andy Lester
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
=back
You can follow or contribute to HTML-Tree's development at
L<< https://github.com/kentfredric/HTML-Tree >>.
=head1 COPYRIGHT AND LICENSE
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
2012 Christopher J. Madsen.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
The programs in this library are distributed in the hope that they
will be useful, but without any warranty; without even the implied
warranty of merchantability or fitness for a particular purpose.
=cut

4486
database/perl/vendor/lib/HTML/Element.pm vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,374 @@
# This is a .pm just to (try to) make some CPAN document converters
# convert it happily as part of the dist's documentation tree.
package HTML::Element::traverse;
# ABSTRACT: discussion of HTML::Element's traverse method
use warnings;
use strict;
our $VERSION = '5.07'; # VERSION from OurPkgVersion
use HTML::Element ();
1;
__END__
=pod
=head1 NAME
HTML::Element::traverse - discussion of HTML::Element's traverse method
=head1 VERSION
This document describes version 5.07 of
HTML::Element::traverse, released August 31, 2017
as part of L<HTML-Tree|HTML::Tree>.
=head1 SYNOPSIS
# $element->traverse is unnecessary and obscure.
# Don't use it in new code.
=head1 DESCRIPTION
C<HTML::Element> provides a method C<traverse> that traverses the tree
and calls user-specified callbacks for each node, in pre- or
post-order. However, use of the method is quite superfluous: if you
want to recursively visit every node in the tree, it's almost always
simpler to write a subroutine does just that, than it is to bundle up
the pre- and/or post-order code in callbacks for the C<traverse>
method.
=head1 EXAMPLES
Suppose you want to traverse at/under a node $tree and give elements
an 'id' attribute unless they already have one.
You can use the C<traverse> method:
{
my $counter = 'x0000';
$start_node->traverse(
[ # Callbacks;
# pre-order callback:
sub {
my $x = $_[0];
$x->attr('id', $counter++) unless defined $x->attr('id');
return HTML::Element::OK; # keep traversing
},
# post-order callback:
undef
],
1, # don't call the callbacks for text nodes
);
}
or you can just be simple and clear (and not have to understand the
calling format for C<traverse>) by writing a sub that traverses the
tree by just calling itself:
{
my $counter = 'x0000';
sub give_id {
my $x = $_[0];
$x->attr('id', $counter++) unless defined $x->attr('id');
foreach my $c ($x->content_list) {
give_id($c) if ref $c; # ignore text nodes
}
};
give_id($start_node);
}
See, isn't that nice and clear?
But, if you really need to know:
=head1 THE TRAVERSE METHOD
The C<traverse()> method is a general object-method for traversing a
tree or subtree and calling user-specified callbacks. It accepts the
following syntaxes:
=over
=item $h->traverse(\&callback)
=item or $h->traverse(\&callback, $ignore_text)
=item or $h->traverse( [\&pre_callback,\&post_callback] , $ignore_text)
=back
These all mean to traverse the element and all of its children. That
is, this method starts at node $h, "pre-order visits" $h, traverses its
children, and then will "post-order visit" $h. "Visiting" means that
the callback routine is called, with these arguments:
$_[0] : the node (element or text segment),
$_[1] : a startflag, and
$_[2] : the depth
If the $ignore_text parameter is given and true, then the pre-order
call I<will not> be happen for text content.
The startflag is 1 when we enter a node (i.e., in pre-order calls) and
0 when we leave the node (in post-order calls).
Note, however, that post-order calls don't happen for nodes that are
text segments or are elements that are prototypically empty (like "br",
"hr", etc.).
If we visit text nodes (i.e., unless $ignore_text is given and true),
then when text nodes are visited, we will also pass two extra
arguments to the callback:
$_[3] : the element that's the parent
of this text node
$_[4] : the index of this text node
in its parent's content list
Note that you can specify that the pre-order routine can
be a different routine from the post-order one:
$h->traverse( [\&pre_callback,\&post_callback], ...);
You can also specify that no post-order calls are to be made,
by providing a false value as the post-order routine:
$h->traverse([ \&pre_callback,0 ], ...);
And similarly for suppressing pre-order callbacks:
$h->traverse([ 0,\&post_callback ], ...);
Note that these two syntaxes specify the same operation:
$h->traverse([\&foo,\&foo], ...);
$h->traverse( \&foo , ...);
The return values from calls to your pre- or post-order
routines are significant, and are used to control recursion
into the tree.
These are the values you can return, listed in descending order
of my estimation of their usefulness:
=over
=item HTML::Element::OK, 1, or any other true value
...to keep on traversing.
Note that C<HTML::Element::OK> et
al are constants. So if you're running under C<use strict>
(as I hope you are), and you say:
C<return HTML::Element::PRUEN>
the compiler will flag this as an error (an unallowable
bareword, specifically), whereas if you spell PRUNE correctly,
the compiler will not complain.
=item undef, 0, '0', '', or HTML::Element::PRUNE
...to block traversing under the current element's content.
(This is ignored if received from a post-order callback,
since by then the recursion has already happened.)
If this is returned by a pre-order callback, no
post-order callback for the current node will happen.
(Recall that if your callback exits with just C<return;>,
it is returning undef -- at least in scalar context, and
C<traverse> always calls your callbacks in scalar context.)
=item HTML::Element::ABORT
...to abort the whole traversal immediately.
This is often useful when you're looking for just the first
node in the tree that meets some criterion of yours.
=item HTML::Element::PRUNE_UP
...to abort continued traversal into this node and its parent
node. No post-order callback for the current or parent
node will happen.
=item HTML::Element::PRUNE_SOFTLY
Like PRUNE, except that the post-order call for the current
node is not blocked.
=back
Almost every task to do with extracting information from a tree can be
expressed in terms of traverse operations (usually in only one pass,
and usually paying attention to only pre-order, or to only
post-order), or operations based on traversing. (In fact, many of the
other methods in this class are basically calls to traverse() with
particular arguments.)
The source code for HTML::Element and HTML::TreeBuilder contain
several examples of the use of the "traverse" method to gather
information about the content of trees and subtrees.
(Note: you should not change the structure of a tree I<while> you are
traversing it.)
[End of documentation for the C<traverse()> method]
=head2 Traversing with Recursive Anonymous Routines
Now, if you've been reading
I<Structure and Interpretation of Computer Programs> too much, maybe
you even want a recursive lambda. Go ahead:
{
my $counter = 'x0000';
my $give_id;
$give_id = sub {
my $x = $_[0];
$x->attr('id', $counter++) unless defined $x->attr('id');
foreach my $c ($x->content_list) {
$give_id->($c) if ref $c; # ignore text nodes
}
};
$give_id->($start_node);
undef $give_id;
}
It's a bit nutty, and it's I<still> more concise than a call to the
C<traverse> method!
It is left as an exercise to the reader to figure out how to do the
same thing without using a C<$give_id> symbol at all.
It is also left as an exercise to the reader to figure out why I
undefine C<$give_id>, above; and why I could achieved the same effect
with any of:
$give_id = 'I like pie!';
# or...
$give_id = [];
# or even;
$give_id = sub { print "Mmmm pie!\n" };
But not:
$give_id = sub { print "I'm $give_id and I like pie!\n" };
# nor...
$give_id = \$give_id;
# nor...
$give_id = { 'pie' => \$give_id, 'mode' => 'a la' };
=head2 Doing Recursive Things Iteratively
Note that you may at times see an iterative implementation of
pre-order traversal, like so:
{
my @to_do = ($tree); # start-node
while(@to_do) {
my $this = shift @to_do;
# "Visit" the node:
$this->attr('id', $counter++)
unless defined $this->attr('id');
unshift @to_do, grep ref $_, $this->content_list;
# Put children on the stack -- they'll be visited next
}
}
This can I<under certain circumstances> be more efficient than just a
normal recursive routine, but at the cost of being rather obscure. It
gains efficiency by avoiding the overhead of function-calling, but
since there are several method dispatches however you do it (to
C<attr> and C<content_list>), the overhead for a simple function call
is insignificant.
=head2 Pruning and Whatnot
The C<traverse> method does have the fairly neat features of
the C<ABORT>, C<PRUNE_UP> and C<PRUNE_SOFTLY> signals. None of these
can be implemented I<totally> straightforwardly with recursive
routines, but it is quite possible. C<ABORT>-like behavior can be
implemented either with using non-local returning with C<eval>/C<die>:
my $died_on; # if you need to know where...
sub thing {
... visits $_[0]...
... maybe set $died_on to $_[0] and die "ABORT_TRAV" ...
... else call thing($child) for each child...
...any post-order visiting $_[0]...
}
eval { thing($node) };
if($@) {
if($@ =~ m<^ABORT_TRAV>) {
...it died (aborted) on $died_on...
} else {
die $@; # some REAL error happened
}
}
or you can just do it with flags:
my($abort_flag, $died_on);
sub thing {
... visits $_[0]...
... maybe set $abort_flag = 1; $died_on = $_[0]; return;
foreach my $c ($_[0]->content_list) {
thing($c);
return if $abort_flag;
}
...any post-order visiting $_[0]...
return;
}
$abort_flag = $died_on = undef;
thing($node);
...if defined $abort_flag, it died on $died_on
=head1 SEE ALSO
L<HTML::Element>
=head1 AUTHOR
Current maintainers:
=over
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
=back
Original HTML-Tree author:
=over
=item * Gisle Aas
=back
Former maintainers:
=over
=item * Sean M. Burke
=item * Andy Lester
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
=back
You can follow or contribute to HTML-Tree's development at
L<< https://github.com/kentfredric/HTML-Tree >>.
=head1 COPYRIGHT
Copyright 2000,2001 Sean M. Burke
=cut

View File

@@ -0,0 +1,482 @@
package HTML::Entities;
=encoding utf8
=head1 NAME
HTML::Entities - Encode or decode strings with HTML entities
=head1 SYNOPSIS
use HTML::Entities;
$a = "V&aring;re norske tegn b&oslash;r &#230res";
decode_entities($a);
encode_entities($a, "\200-\377");
For example, this:
$input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
print encode_entities($input), "\n"
Prints this out:
vis-&agrave;-vis Beyonc&eacute;'s na&iuml;ve
papier-m&acirc;ch&eacute; r&eacute;sum&eacute;
=head1 DESCRIPTION
This module deals with encoding and decoding of strings with HTML
character entities. The module provides the following functions:
=over 4
=item decode_entities( $string, ... )
This routine replaces HTML entities found in the $string with the
corresponding Unicode character. Unrecognized entities are left alone.
If multiple strings are provided as argument they are each decoded
separately and the same number of strings are returned.
If called in void context the arguments are decoded in-place.
This routine is exported by default.
=item _decode_entities( $string, \%entity2char )
=item _decode_entities( $string, \%entity2char, $expand_prefix )
This will in-place replace HTML entities in $string. The %entity2char
hash must be provided. Named entities not found in the %entity2char
hash are left alone. Numeric entities are expanded unless their value
overflow.
The keys in %entity2char are the entity names to be expanded and their
values are what they should expand into. The values do not have to be
single character strings. If a key has ";" as suffix,
then occurrences in $string are only expanded if properly terminated
with ";". Entities without ";" will be expanded regardless of how
they are terminated for compatibility with how common browsers treat
entities in the Latin-1 range.
If $expand_prefix is TRUE then entities without trailing ";" in
%entity2char will even be expanded as a prefix of a longer
unrecognized name. The longest matching name in %entity2char will be
used. This is mainly present for compatibility with an MSIE
misfeature.
$string = "foo&nbspbar";
_decode_entities($string, { nb => "@", nbsp => "\xA0" }, 1);
print $string; # will print "foo bar"
This routine is exported by default.
=item encode_entities( $string )
=item encode_entities( $string, $unsafe_chars )
This routine replaces unsafe characters in $string with their entity
representation. A second argument can be given to specify which characters to
consider unsafe. The unsafe characters is specified using the regular
expression character class syntax (what you find within brackets in regular
expressions).
The default set of characters to encode are control chars, high-bit chars, and
the C<< < >>, C<< & >>, C<< > >>, C<< ' >> and C<< " >> characters. But this,
for example, would encode I<just> the C<< < >>, C<< & >>, C<< > >>, and C<< "
>> characters:
$encoded = encode_entities($input, '<>&"');
and this would only encode non-plain ASCII:
$encoded = encode_entities($input, '^\n\x20-\x25\x27-\x7e');
This routine is exported by default.
=item encode_entities_numeric( $string )
=item encode_entities_numeric( $string, $unsafe_chars )
This routine works just like encode_entities, except that the replacement
entities are always C<&#xI<hexnum>;> and never C<&I<entname>;>. For
example, C<encode_entities("r\xF4le")> returns "r&ocirc;le", but
C<encode_entities_numeric("r\xF4le")> returns "r&#xF4;le".
This routine is I<not> exported by default. But you can always
export it with C<use HTML::Entities qw(encode_entities_numeric);>
or even C<use HTML::Entities qw(:DEFAULT encode_entities_numeric);>
=back
All these routines modify the string passed as the first argument, if
called in a void context. In scalar and array contexts, the encoded or
decoded string is returned (without changing the input string).
If you prefer not to import these routines into your namespace, you can
call them as:
use HTML::Entities ();
$decoded = HTML::Entities::decode($a);
$encoded = HTML::Entities::encode($a);
$encoded = HTML::Entities::encode_numeric($a);
The module can also export the %char2entity and the %entity2char
hashes, which contain the mapping from all characters to the
corresponding entities (and vice versa, respectively).
=head1 COPYRIGHT
Copyright 1995-2006 Gisle Aas. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut
use strict;
our $VERSION = '3.75';
use vars qw(%entity2char %char2entity);
require 5.004;
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw(encode_entities decode_entities _decode_entities);
our @EXPORT_OK = qw(%entity2char %char2entity encode_entities_numeric);
sub Version { $VERSION; }
require HTML::Parser; # for fast XS implemented decode_entities
%entity2char = (
# Some normal chars that have special meaning in SGML context
amp => '&', # ampersand
'gt' => '>', # greater than
'lt' => '<', # less than
quot => '"', # double quote
apos => "'", # single quote
# PUBLIC ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML
AElig => chr(198), # capital AE diphthong (ligature)
Aacute => chr(193), # capital A, acute accent
Acirc => chr(194), # capital A, circumflex accent
Agrave => chr(192), # capital A, grave accent
Aring => chr(197), # capital A, ring
Atilde => chr(195), # capital A, tilde
Auml => chr(196), # capital A, dieresis or umlaut mark
Ccedil => chr(199), # capital C, cedilla
ETH => chr(208), # capital Eth, Icelandic
Eacute => chr(201), # capital E, acute accent
Ecirc => chr(202), # capital E, circumflex accent
Egrave => chr(200), # capital E, grave accent
Euml => chr(203), # capital E, dieresis or umlaut mark
Iacute => chr(205), # capital I, acute accent
Icirc => chr(206), # capital I, circumflex accent
Igrave => chr(204), # capital I, grave accent
Iuml => chr(207), # capital I, dieresis or umlaut mark
Ntilde => chr(209), # capital N, tilde
Oacute => chr(211), # capital O, acute accent
Ocirc => chr(212), # capital O, circumflex accent
Ograve => chr(210), # capital O, grave accent
Oslash => chr(216), # capital O, slash
Otilde => chr(213), # capital O, tilde
Ouml => chr(214), # capital O, dieresis or umlaut mark
THORN => chr(222), # capital THORN, Icelandic
Uacute => chr(218), # capital U, acute accent
Ucirc => chr(219), # capital U, circumflex accent
Ugrave => chr(217), # capital U, grave accent
Uuml => chr(220), # capital U, dieresis or umlaut mark
Yacute => chr(221), # capital Y, acute accent
aacute => chr(225), # small a, acute accent
acirc => chr(226), # small a, circumflex accent
aelig => chr(230), # small ae diphthong (ligature)
agrave => chr(224), # small a, grave accent
aring => chr(229), # small a, ring
atilde => chr(227), # small a, tilde
auml => chr(228), # small a, dieresis or umlaut mark
ccedil => chr(231), # small c, cedilla
eacute => chr(233), # small e, acute accent
ecirc => chr(234), # small e, circumflex accent
egrave => chr(232), # small e, grave accent
eth => chr(240), # small eth, Icelandic
euml => chr(235), # small e, dieresis or umlaut mark
iacute => chr(237), # small i, acute accent
icirc => chr(238), # small i, circumflex accent
igrave => chr(236), # small i, grave accent
iuml => chr(239), # small i, dieresis or umlaut mark
ntilde => chr(241), # small n, tilde
oacute => chr(243), # small o, acute accent
ocirc => chr(244), # small o, circumflex accent
ograve => chr(242), # small o, grave accent
oslash => chr(248), # small o, slash
otilde => chr(245), # small o, tilde
ouml => chr(246), # small o, dieresis or umlaut mark
szlig => chr(223), # small sharp s, German (sz ligature)
thorn => chr(254), # small thorn, Icelandic
uacute => chr(250), # small u, acute accent
ucirc => chr(251), # small u, circumflex accent
ugrave => chr(249), # small u, grave accent
uuml => chr(252), # small u, dieresis or umlaut mark
yacute => chr(253), # small y, acute accent
yuml => chr(255), # small y, dieresis or umlaut mark
# Some extra Latin 1 chars that are listed in the HTML3.2 draft (21-May-96)
copy => chr(169), # copyright sign
reg => chr(174), # registered sign
nbsp => chr(160), # non breaking space
# Additional ISO-8859/1 entities listed in rfc1866 (section 14)
iexcl => chr(161),
cent => chr(162),
pound => chr(163),
curren => chr(164),
yen => chr(165),
brvbar => chr(166),
sect => chr(167),
uml => chr(168),
ordf => chr(170),
laquo => chr(171),
'not' => chr(172), # not is a keyword in perl
shy => chr(173),
macr => chr(175),
deg => chr(176),
plusmn => chr(177),
sup1 => chr(185),
sup2 => chr(178),
sup3 => chr(179),
acute => chr(180),
micro => chr(181),
para => chr(182),
middot => chr(183),
cedil => chr(184),
ordm => chr(186),
raquo => chr(187),
frac14 => chr(188),
frac12 => chr(189),
frac34 => chr(190),
iquest => chr(191),
'times' => chr(215), # times is a keyword in perl
divide => chr(247),
( $] > 5.007 ? (
'OElig;' => chr(338),
'oelig;' => chr(339),
'Scaron;' => chr(352),
'scaron;' => chr(353),
'Yuml;' => chr(376),
'fnof;' => chr(402),
'circ;' => chr(710),
'tilde;' => chr(732),
'Alpha;' => chr(913),
'Beta;' => chr(914),
'Gamma;' => chr(915),
'Delta;' => chr(916),
'Epsilon;' => chr(917),
'Zeta;' => chr(918),
'Eta;' => chr(919),
'Theta;' => chr(920),
'Iota;' => chr(921),
'Kappa;' => chr(922),
'Lambda;' => chr(923),
'Mu;' => chr(924),
'Nu;' => chr(925),
'Xi;' => chr(926),
'Omicron;' => chr(927),
'Pi;' => chr(928),
'Rho;' => chr(929),
'Sigma;' => chr(931),
'Tau;' => chr(932),
'Upsilon;' => chr(933),
'Phi;' => chr(934),
'Chi;' => chr(935),
'Psi;' => chr(936),
'Omega;' => chr(937),
'alpha;' => chr(945),
'beta;' => chr(946),
'gamma;' => chr(947),
'delta;' => chr(948),
'epsilon;' => chr(949),
'zeta;' => chr(950),
'eta;' => chr(951),
'theta;' => chr(952),
'iota;' => chr(953),
'kappa;' => chr(954),
'lambda;' => chr(955),
'mu;' => chr(956),
'nu;' => chr(957),
'xi;' => chr(958),
'omicron;' => chr(959),
'pi;' => chr(960),
'rho;' => chr(961),
'sigmaf;' => chr(962),
'sigma;' => chr(963),
'tau;' => chr(964),
'upsilon;' => chr(965),
'phi;' => chr(966),
'chi;' => chr(967),
'psi;' => chr(968),
'omega;' => chr(969),
'thetasym;' => chr(977),
'upsih;' => chr(978),
'piv;' => chr(982),
'ensp;' => chr(8194),
'emsp;' => chr(8195),
'thinsp;' => chr(8201),
'zwnj;' => chr(8204),
'zwj;' => chr(8205),
'lrm;' => chr(8206),
'rlm;' => chr(8207),
'ndash;' => chr(8211),
'mdash;' => chr(8212),
'lsquo;' => chr(8216),
'rsquo;' => chr(8217),
'sbquo;' => chr(8218),
'ldquo;' => chr(8220),
'rdquo;' => chr(8221),
'bdquo;' => chr(8222),
'dagger;' => chr(8224),
'Dagger;' => chr(8225),
'bull;' => chr(8226),
'hellip;' => chr(8230),
'permil;' => chr(8240),
'prime;' => chr(8242),
'Prime;' => chr(8243),
'lsaquo;' => chr(8249),
'rsaquo;' => chr(8250),
'oline;' => chr(8254),
'frasl;' => chr(8260),
'euro;' => chr(8364),
'image;' => chr(8465),
'weierp;' => chr(8472),
'real;' => chr(8476),
'trade;' => chr(8482),
'alefsym;' => chr(8501),
'larr;' => chr(8592),
'uarr;' => chr(8593),
'rarr;' => chr(8594),
'darr;' => chr(8595),
'harr;' => chr(8596),
'crarr;' => chr(8629),
'lArr;' => chr(8656),
'uArr;' => chr(8657),
'rArr;' => chr(8658),
'dArr;' => chr(8659),
'hArr;' => chr(8660),
'forall;' => chr(8704),
'part;' => chr(8706),
'exist;' => chr(8707),
'empty;' => chr(8709),
'nabla;' => chr(8711),
'isin;' => chr(8712),
'notin;' => chr(8713),
'ni;' => chr(8715),
'prod;' => chr(8719),
'sum;' => chr(8721),
'minus;' => chr(8722),
'lowast;' => chr(8727),
'radic;' => chr(8730),
'prop;' => chr(8733),
'infin;' => chr(8734),
'ang;' => chr(8736),
'and;' => chr(8743),
'or;' => chr(8744),
'cap;' => chr(8745),
'cup;' => chr(8746),
'int;' => chr(8747),
'there4;' => chr(8756),
'sim;' => chr(8764),
'cong;' => chr(8773),
'asymp;' => chr(8776),
'ne;' => chr(8800),
'equiv;' => chr(8801),
'le;' => chr(8804),
'ge;' => chr(8805),
'sub;' => chr(8834),
'sup;' => chr(8835),
'nsub;' => chr(8836),
'sube;' => chr(8838),
'supe;' => chr(8839),
'oplus;' => chr(8853),
'otimes;' => chr(8855),
'perp;' => chr(8869),
'sdot;' => chr(8901),
'lceil;' => chr(8968),
'rceil;' => chr(8969),
'lfloor;' => chr(8970),
'rfloor;' => chr(8971),
'lang;' => chr(9001),
'rang;' => chr(9002),
'loz;' => chr(9674),
'spades;' => chr(9824),
'clubs;' => chr(9827),
'hearts;' => chr(9829),
'diams;' => chr(9830),
) : ())
);
# Make the opposite mapping
while (my($entity, $char) = each(%entity2char)) {
$entity =~ s/;\z//;
$char2entity{$char} = "&$entity;";
}
delete $char2entity{"'"}; # only one-way decoding
# Fill in missing entities
for (0 .. 255) {
next if exists $char2entity{chr($_)};
$char2entity{chr($_)} = "&#$_;";
}
my %subst; # compiled encoding regexps
sub encode_entities
{
return undef unless defined $_[0];
my $ref;
if (defined wantarray) {
my $x = $_[0];
$ref = \$x; # copy
} else {
$ref = \$_[0]; # modify in-place
}
if (defined $_[1] and length $_[1]) {
unless (exists $subst{$_[1]}) {
# Because we can't compile regex we fake it with a cached sub
my $chars = $_[1];
$chars =~ s,(?<!\\)([]/]),\\$1,g;
$chars =~ s,(?<!\\)\\\z,\\\\,;
my $code = "sub {\$_[0] =~ s/([$chars])/\$char2entity{\$1} || num_entity(\$1)/ge; }";
$subst{$_[1]} = eval $code;
die( $@ . " while trying to turn range: \"$_[1]\"\n "
. "into code: $code\n "
) if $@;
}
&{$subst{$_[1]}}($$ref);
} else {
# Encode control chars, high bit chars and '<', '&', '>', ''' and '"'
$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;
}
$$ref;
}
sub encode_entities_numeric {
local %char2entity;
return &encode_entities; # a goto &encode_entities wouldn't work
}
sub num_entity {
sprintf "&#x%X;", ord($_[0]);
}
# Set up aliases
*encode = \&encode_entities;
*encode_numeric = \&encode_entities_numeric;
*encode_numerically = \&encode_entities_numeric;
*decode = \&decode_entities;
1;

110
database/perl/vendor/lib/HTML/Filter.pm vendored Normal file
View File

@@ -0,0 +1,110 @@
package HTML::Filter;
use strict;
require HTML::Parser;
our @ISA = qw(HTML::Parser);
our $VERSION = '3.75';
sub declaration { $_[0]->output("<!$_[1]>") }
sub process { $_[0]->output($_[2]) }
sub comment { $_[0]->output("<!--$_[1]-->") }
sub start { $_[0]->output($_[4]) }
sub end { $_[0]->output($_[2]) }
sub text { $_[0]->output($_[1]) }
sub output { print $_[1] }
1;
__END__
=head1 NAME
HTML::Filter - Filter HTML text through the parser
=head1 NOTE
B<This module is deprecated.> The C<HTML::Parser> now provides the
functionally of C<HTML::Filter> much more efficiently with the
C<default> handler.
=head1 SYNOPSIS
require HTML::Filter;
$p = HTML::Filter->new->parse_file("index.html");
=head1 DESCRIPTION
C<HTML::Filter> is an HTML parser that by default prints the
original text of each HTML element (a slow version of cat(1) basically).
The callback methods may be overridden to modify the filtering for some
HTML elements and you can override output() method which is called to
print the HTML text.
C<HTML::Filter> is a subclass of C<HTML::Parser>. This means that
the document should be given to the parser by calling the $p->parse()
or $p->parse_file() methods.
=head1 EXAMPLES
The first example is a filter that will remove all comments from an
HTML file. This is achieved by simply overriding the comment method
to do nothing.
package CommentStripper;
require HTML::Filter;
@ISA=qw(HTML::Filter);
sub comment { } # ignore comments
The second example shows a filter that will remove any E<lt>TABLE>s
found in the HTML file. We specialize the start() and end() methods
to count table tags and then make output not happen when inside a
table.
package TableStripper;
require HTML::Filter;
@ISA=qw(HTML::Filter);
sub start
{
my $self = shift;
$self->{table_seen}++ if $_[0] eq "table";
$self->SUPER::start(@_);
}
sub end
{
my $self = shift;
$self->SUPER::end(@_);
$self->{table_seen}-- if $_[0] eq "table";
}
sub output
{
my $self = shift;
unless ($self->{table_seen}) {
$self->SUPER::output(@_);
}
}
If you want to collect the parsed text internally you might want to do
something like this:
package FilterIntoString;
require HTML::Filter;
@ISA=qw(HTML::Filter);
sub output { push(@{$_[0]->{fhtml}}, $_[1]) }
sub filtered_html { join("", @{$_[0]->{fhtml}}) }
=head1 SEE ALSO
L<HTML::Parser>
=head1 COPYRIGHT
Copyright 1997-1999 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut

1580
database/perl/vendor/lib/HTML/Form.pm vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,314 @@
package HTML::HeadParser;
=head1 NAME
HTML::HeadParser - Parse <HEAD> section of a HTML document
=head1 SYNOPSIS
require HTML::HeadParser;
$p = HTML::HeadParser->new;
$p->parse($text) and print "not finished";
$p->header('Title') # to access <title>....</title>
$p->header('Content-Base') # to access <base href="http://...">
$p->header('Foo') # to access <meta http-equiv="Foo" content="...">
$p->header('X-Meta-Author') # to access <meta name="author" content="...">
$p->header('X-Meta-Charset') # to access <meta charset="...">
=head1 DESCRIPTION
The C<HTML::HeadParser> is a specialized (and lightweight)
C<HTML::Parser> that will only parse the E<lt>HEAD>...E<lt>/HEAD>
section of an HTML document. The parse() method
will return a FALSE value as soon as some E<lt>BODY> element or body
text are found, and should not be called again after this.
Note that the C<HTML::HeadParser> might get confused if raw undecoded
UTF-8 is passed to the parse() method. Make sure the strings are
properly decoded before passing them on.
The C<HTML::HeadParser> keeps a reference to a header object, and the
parser will update this header object as the various elements of the
E<lt>HEAD> section of the HTML document are recognized. The following
header fields are affected:
=over 4
=item Content-Base:
The I<Content-Base> header is initialized from the E<lt>base
href="..."> element.
=item Title:
The I<Title> header is initialized from the E<lt>title>...E<lt>/title>
element.
=item Isindex:
The I<Isindex> header will be added if there is a E<lt>isindex>
element in the E<lt>head>. The header value is initialized from the
I<prompt> attribute if it is present. If no I<prompt> attribute is
given it will have '?' as the value.
=item X-Meta-Foo:
All E<lt>meta> elements containing a C<name> attribute will result in
headers using the prefix C<X-Meta-> appended with the value of the
C<name> attribute as the name of the header, and the value of the
C<content> attribute as the pushed header value.
E<lt>meta> elements containing a C<http-equiv> attribute will result
in headers as in above, but without the C<X-Meta-> prefix in the
header name.
E<lt>meta> elements containing a C<charset> attribute will result in
an C<X-Meta-Charset> header, using the value of the C<charset>
attribute as the pushed header value.
The ':' character can't be represented in header field names, so
if the meta element contains this char it's substituted with '-'
before forming the field name.
=back
=head1 METHODS
The following methods (in addition to those provided by the
superclass) are available:
=over 4
=cut
require HTML::Parser;
our @ISA = qw(HTML::Parser);
use HTML::Entities ();
use strict;
use vars qw($DEBUG);
#$DEBUG = 1;
our $VERSION = '3.75';
=item $hp = HTML::HeadParser->new
=item $hp = HTML::HeadParser->new( $header )
The object constructor. The optional $header argument should be a
reference to an object that implement the header() and push_header()
methods as defined by the C<HTTP::Headers> class. Normally it will be
of some class that is a or delegates to the C<HTTP::Headers> class.
If no $header is given C<HTML::HeadParser> will create an
C<HTTP::Headers> object by itself (initially empty).
=cut
sub new
{
my($class, $header) = @_;
unless ($header) {
require HTTP::Headers;
$header = HTTP::Headers->new;
}
my $self = $class->SUPER::new(api_version => 3,
start_h => ["start", "self,tagname,attr"],
end_h => ["end", "self,tagname"],
text_h => ["text", "self,text"],
ignore_elements => [qw(script style)],
);
$self->{'header'} = $header;
$self->{'tag'} = ''; # name of active element that takes textual content
$self->{'text'} = ''; # the accumulated text associated with the element
$self;
}
=item $hp->header;
Returns a reference to the header object.
=item $hp->header( $key )
Returns a header value. It is just a shorter way to write
C<$hp-E<gt>header-E<gt>header($key)>.
=cut
sub header
{
my $self = shift;
return $self->{'header'} unless @_;
$self->{'header'}->header(@_);
}
sub as_string # legacy
{
my $self = shift;
$self->{'header'}->as_string;
}
sub flush_text # internal
{
my $self = shift;
my $tag = $self->{'tag'};
my $text = $self->{'text'};
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "FLUSH $tag => '$text'\n" if $DEBUG;
if ($tag eq 'title') {
my $decoded;
$decoded = utf8::decode($text) if $self->utf8_mode && defined &utf8::decode;
HTML::Entities::decode($text);
utf8::encode($text) if $decoded;
$self->{'header'}->push_header(Title => $text);
}
$self->{'tag'} = $self->{'text'} = '';
}
# This is an quote from the HTML3.2 DTD which shows which elements
# that might be present in a <HEAD>...</HEAD>. Also note that the
# <HEAD> tags themselves might be missing:
#
# <!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &
# SCRIPT* & META* & LINK*">
#
# <!ELEMENT HEAD O O (%head.content)>
#
# From HTML 4.01:
#
# <!ENTITY % head.misc "SCRIPT|STYLE|META|LINK|OBJECT">
# <!ENTITY % head.content "TITLE & BASE?">
# <!ELEMENT HEAD O O (%head.content;) +(%head.misc;)>
#
# From HTML 5 as of WD-html5-20090825:
#
# One or more elements of metadata content, [...]
# => base, command, link, meta, noscript, script, style, title
sub start
{
my($self, $tag, $attr) = @_; # $attr is reference to a HASH
print "START[$tag]\n" if $DEBUG;
$self->flush_text if $self->{'tag'};
if ($tag eq 'meta') {
my $key = $attr->{'http-equiv'};
if (!defined($key) || !length($key)) {
if ($attr->{name}) {
$key = "X-Meta-\u$attr->{name}";
} elsif ($attr->{charset}) { # HTML 5 <meta charset="...">
$key = "X-Meta-Charset";
$self->{header}->push_header($key => $attr->{charset});
return;
} else {
return;
}
}
$key =~ s/:/-/g;
$self->{'header'}->push_header($key => $attr->{content});
} elsif ($tag eq 'base') {
return unless exists $attr->{href};
(my $base = $attr->{href}) =~ s/^\s+//; $base =~ s/\s+$//; # HTML5
$self->{'header'}->push_header('Content-Base' => $base);
} elsif ($tag eq 'isindex') {
# This is a non-standard header. Perhaps we should just ignore
# this element
$self->{'header'}->push_header(Isindex => $attr->{prompt} || '?');
} elsif ($tag =~ /^(?:title|noscript|object|command)$/) {
# Just remember tag. Initialize header when we see the end tag.
$self->{'tag'} = $tag;
} elsif ($tag eq 'link') {
return unless exists $attr->{href};
# <link href="http:..." rel="xxx" rev="xxx" title="xxx">
my $href = delete($attr->{href});
$href =~ s/^\s+//; $href =~ s/\s+$//; # HTML5
my $h_val = "<$href>";
for (sort keys %{$attr}) {
next if $_ eq "/"; # XHTML junk
$h_val .= qq(; $_="$attr->{$_}");
}
$self->{'header'}->push_header(Link => $h_val);
} elsif ($tag eq 'head' || $tag eq 'html') {
# ignore
} else {
# stop parsing
$self->eof;
}
}
sub end
{
my($self, $tag) = @_;
print "END[$tag]\n" if $DEBUG;
$self->flush_text if $self->{'tag'};
$self->eof if $tag eq 'head';
}
sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
unless ($self->{first_chunk}) {
# drop Unicode BOM if found
if ($self->utf8_mode) {
$text =~ s/^\xEF\xBB\xBF//;
}
else {
$text =~ s/^\x{FEFF}//;
}
$self->{first_chunk}++;
}
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/) {
# Normal text means start of body
$self->eof;
return;
}
return if $tag ne 'title';
$self->{'text'} .= $text;
}
BEGIN {
*utf8_mode = sub { 1 } unless HTML::Entities::UNICODE_SUPPORT;
}
1;
__END__
=back
=head1 EXAMPLE
$h = HTTP::Headers->new;
$p = HTML::HeadParser->new($h);
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/">
Normal text starts here.
EOT
undef $p;
print $h->title; # should print "Stupid example"
=head1 SEE ALSO
L<HTML::Parser>, L<HTTP::Headers>
The C<HTTP::Headers> class is distributed as part of the
I<libwww-perl> package. If you don't have that distribution installed
you need to provide the $header argument to the C<HTML::HeadParser>
constructor with your own object that implements the documented
protocol.
=head1 COPYRIGHT
Copyright 1996-2001 Gisle Aas. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut

View File

@@ -0,0 +1,185 @@
package HTML::LinkExtor;
require HTML::Parser;
our @ISA = qw(HTML::Parser);
our $VERSION = '3.75';
=head1 NAME
HTML::LinkExtor - Extract links from an HTML document
=head1 SYNOPSIS
require HTML::LinkExtor;
$p = HTML::LinkExtor->new(\&cb, "http://www.perl.org/");
sub cb {
my($tag, %links) = @_;
print "$tag @{[%links]}\n";
}
$p->parse_file("index.html");
=head1 DESCRIPTION
I<HTML::LinkExtor> is an HTML parser that extracts links from an
HTML document. The I<HTML::LinkExtor> is a subclass of
I<HTML::Parser>. This means that the document should be given to the
parser by calling the $p->parse() or $p->parse_file() methods.
=cut
use strict;
use HTML::Tagset ();
# legacy (some applications grabs this hash directly)
use vars qw(%LINK_ELEMENT);
*LINK_ELEMENT = \%HTML::Tagset::linkElements;
=over 4
=item $p = HTML::LinkExtor->new
=item $p = HTML::LinkExtor->new( $callback )
=item $p = HTML::LinkExtor->new( $callback, $base )
The constructor takes two optional arguments. The first is a reference
to a callback routine. It will be called as links are found. If a
callback is not provided, then links are just accumulated internally
and can be retrieved by calling the $p->links() method.
The $base argument is an optional base URL used to absolutize all URLs found.
You need to have the I<URI> module installed if you provide $base.
The callback is called with the lowercase tag name as first argument,
and then all link attributes as separate key/value pairs. All
non-link attributes are removed.
=cut
sub new
{
my($class, $cb, $base) = @_;
my $self = $class->SUPER::new(
start_h => ["_start_tag", "self,tagname,attr"],
report_tags => [keys %HTML::Tagset::linkElements],
);
$self->{extractlink_cb} = $cb;
if ($base) {
require URI;
$self->{extractlink_base} = URI->new($base);
}
$self;
}
sub _start_tag
{
my($self, $tag, $attr) = @_;
my $base = $self->{extractlink_base};
my $links = $HTML::Tagset::linkElements{$tag};
$links = [$links] unless ref $links;
my @links;
my $a;
for $a (@$links) {
next unless exists $attr->{$a};
(my $link = $attr->{$a}) =~ s/^\s+//; $link =~ s/\s+$//; # HTML5
push(@links, $a, $base ? URI->new($link, $base)->abs($base) : $link);
}
return unless @links;
$self->_found_link($tag, @links);
}
sub _found_link
{
my $self = shift;
my $cb = $self->{extractlink_cb};
if ($cb) {
&$cb(@_);
} else {
push(@{$self->{'links'}}, [@_]);
}
}
=item $p->links
Returns a list of all links found in the document. The returned
values will be anonymous arrays with the following elements:
[$tag, $attr => $url1, $attr2 => $url2,...]
The $p->links method will also truncate the internal link list. This
means that if the method is called twice without any parsing
between them the second call will return an empty list.
Also note that $p->links will always be empty if a callback routine
was provided when the I<HTML::LinkExtor> was created.
=cut
sub links
{
my $self = shift;
exists($self->{'links'}) ? @{delete $self->{'links'}} : ();
}
# We override the parse_file() method so that we can clear the links
# before we start a new file.
sub parse_file
{
my $self = shift;
delete $self->{'links'};
$self->SUPER::parse_file(@_);
}
=back
=head1 EXAMPLE
This is an example showing how you can extract links from a document
received using LWP:
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$url = "http://www.perl.org/"; # for instance
$ua = LWP::UserAgent->new;
# Set up a callback that collect image links
my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img'; # we only look closer at <img ...>
push(@imgs, values %attr);
}
# Make the parser. Unfortunately, we don't know the base yet
# (it might be different from $url)
$p = HTML::LinkExtor->new(\&callback);
# Request document and parse it as it arrives
$res = $ua->request(HTTP::Request->new(GET => $url),
sub {$p->parse($_[0])});
# Expand all image URLs to absolute ones
my $base = $res->base;
@imgs = map { $_ = url($_, $base)->abs; } @imgs;
# Print them out
print join("\n", @imgs), "\n";
=head1 SEE ALSO
L<HTML::Parser>, L<HTML::Tagset>, L<LWP>, L<URI::URL>
=head1 COPYRIGHT
Copyright 1996-2001 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut
1;

196
database/perl/vendor/lib/HTML/Parse.pm vendored Normal file
View File

@@ -0,0 +1,196 @@
package HTML::Parse;
use 5.008;
#ABSTRACT: Deprecated, a wrapper around HTML::TreeBuilder
use warnings;
use strict;
our $VERSION = '5.07'; # VERSION from OurPkgVersion
use vars qw(@ISA @EXPORT
$IMPLICIT_TAGS $IGNORE_UNKNOWN $IGNORE_TEXT $WARN
);
require Exporter;
@ISA = qw(Exporter);
@EXPORT = qw(parse_html parse_htmlfile);
# Backwards compatability
$IMPLICIT_TAGS = 1;
$IGNORE_UNKNOWN = 1;
$IGNORE_TEXT = 0;
$WARN = 0;
require HTML::TreeBuilder;
sub parse_html {
my $p = $_[1];
$p = _new_tree_maker() unless $p;
$p->parse( $_[0] );
}
sub parse_htmlfile {
my ( $file, $p ) = @_;
my ($HTML);
open( $HTML, "<", $file ) or return;
$p = _new_tree_maker() unless $p;
$p->parse_file($HTML);
}
sub _new_tree_maker {
my $p = HTML::TreeBuilder->new(
implicit_tags => $IMPLICIT_TAGS,
ignore_unknown => $IGNORE_UNKNOWN,
ignore_text => $IGNORE_TEXT,
'warn' => $WARN,
);
$p->strict_comment(1);
$p;
}
1;
__END__
=pod
=head1 NAME
HTML::Parse - Deprecated, a wrapper around HTML::TreeBuilder
=head1 VERSION
This document describes version 5.07 of
HTML::Parse, released August 31, 2017
as part of L<HTML-Tree|HTML::Tree>.
=head1 SYNOPSIS
See the documentation for HTML::TreeBuilder
=head1 DESCRIPTION
Disclaimer: This module is provided only for backwards compatibility
with earlier versions of this library. New code should I<not> use
this module, and should really use the HTML::Parser and
HTML::TreeBuilder modules directly, instead.
The C<HTML::Parse> module provides functions to parse HTML documents.
There are two functions exported by this module:
=over 4
=item parse_html($html) or parse_html($html, $obj)
This function is really just a synonym for $obj->parse($html) and $obj
is assumed to be a subclass of C<HTML::Parser>. Refer to
L<HTML::Parser> for more documentation.
If $obj is not specified, the $obj will default to an internally
created new C<HTML::TreeBuilder> object configured with strict_comment()
turned on. That class implements a parser that builds (and is) a HTML
syntax tree with HTML::Element objects as nodes.
The return value from parse_html() is $obj.
=item parse_htmlfile($file, [$obj])
Same as parse_html(), but pulls the HTML to parse, from the named file.
Returns C<undef> if the file could not be opened, or $obj otherwise.
=back
When a C<HTML::TreeBuilder> object is created, the following variables
control how parsing takes place:
=over 4
=item $HTML::Parse::IMPLICIT_TAGS
Setting this variable to true will instruct the parser to try to
deduce implicit elements and implicit end tags. If this variable is
false you get a parse tree that just reflects the text as it stands.
Might be useful for quick & dirty parsing. Default is true.
Implicit elements have the implicit() attribute set.
=item $HTML::Parse::IGNORE_UNKNOWN
This variable contols whether unknow tags should be represented as
elements in the parse tree. Default is true.
=item $HTML::Parse::IGNORE_TEXT
Do not represent the text content of elements. This saves space if
all you want is to examine the structure of the document. Default is
false.
=item $HTML::Parse::WARN
Call warn() with an appropriate message for syntax errors. Default is
false.
=back
=head1 REMEMBER!
HTML::TreeBuilder objects should be explicitly destroyed when you're
finished with them. See L<HTML::TreeBuilder>.
=head1 SEE ALSO
L<HTML::Parser>, L<HTML::TreeBuilder>, L<HTML::Element>
=head1 AUTHOR
Current maintainers:
=over
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
=back
Original HTML-Tree author:
=over
=item * Gisle Aas
=back
Former maintainers:
=over
=item * Sean M. Burke
=item * Andy Lester
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
=back
You can follow or contribute to HTML-Tree's development at
L<< https://github.com/kentfredric/HTML-Tree >>.
=head1 COPYRIGHT AND LICENSE
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
2012 Christopher J. Madsen.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
The programs in this library are distributed in the hope that they
will be useful, but without any warranty; without even the implied
warranty of merchantability or fitness for a particular purpose.
=cut

1252
database/perl/vendor/lib/HTML/Parser.pm vendored Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,210 @@
package HTML::PullParser;
use strict;
require HTML::Parser;
our @ISA = qw(HTML::Parser);
our $VERSION = '3.75';
use Carp ();
sub new
{
my($class, %cnf) = @_;
# Construct argspecs for the various events
my %argspec;
for (qw(start end text declaration comment process default)) {
my $tmp = delete $cnf{$_};
next unless defined $tmp;
$argspec{$_} = $tmp;
}
Carp::croak("Info not collected for any events")
unless %argspec;
my $file = delete $cnf{file};
my $doc = delete $cnf{doc};
Carp::croak("Can't parse from both 'doc' and 'file' at the same time")
if defined($file) && defined($doc);
Carp::croak("No 'doc' or 'file' given to parse from")
unless defined($file) || defined($doc);
# Create object
$cnf{api_version} = 3;
my $self = $class->SUPER::new(%cnf);
my $accum = $self->{pullparser_accum} = [];
while (my($event, $argspec) = each %argspec) {
$self->SUPER::handler($event => $accum, $argspec);
}
if (defined $doc) {
$self->{pullparser_str_ref} = ref($doc) ? $doc : \$doc;
$self->{pullparser_str_pos} = 0;
}
else {
if (!ref($file) && ref(\$file) ne "GLOB") {
require IO::File;
$file = IO::File->new($file, "r") || return;
}
$self->{pullparser_file} = $file;
}
$self;
}
sub handler
{
Carp::croak("Can't set handlers for HTML::PullParser");
}
sub get_token
{
my $self = shift;
while (!@{$self->{pullparser_accum}} && !$self->{pullparser_eof}) {
if (my $f = $self->{pullparser_file}) {
# must try to parse more from the file
my $buf;
if (read($f, $buf, 512)) {
$self->parse($buf);
} else {
$self->eof;
$self->{pullparser_eof}++;
delete $self->{pullparser_file};
}
}
elsif (my $sref = $self->{pullparser_str_ref}) {
# must try to parse more from the scalar
my $pos = $self->{pullparser_str_pos};
my $chunk = substr($$sref, $pos, 512);
$self->parse($chunk);
$pos += length($chunk);
if ($pos < length($$sref)) {
$self->{pullparser_str_pos} = $pos;
}
else {
$self->eof;
$self->{pullparser_eof}++;
delete $self->{pullparser_str_ref};
delete $self->{pullparser_str_pos};
}
}
else {
die;
}
}
shift @{$self->{pullparser_accum}};
}
sub unget_token
{
my $self = shift;
unshift @{$self->{pullparser_accum}}, @_;
$self;
}
1;
__END__
=head1 NAME
HTML::PullParser - Alternative HTML::Parser interface
=head1 SYNOPSIS
use HTML::PullParser;
$p = HTML::PullParser->new(file => "index.html",
start => 'event, tagname, @attr',
end => 'event, tagname',
ignore_elements => [qw(script style)],
) || die "Can't open: $!";
while (my $token = $p->get_token) {
#...do something with $token
}
=head1 DESCRIPTION
The HTML::PullParser is an alternative interface to the HTML::Parser class.
It basically turns the HTML::Parser inside out. You associate a file
(or any IO::Handle object or string) with the parser at construction time and
then repeatedly call $parser->get_token to obtain the tags and text
found in the parsed document.
The following methods are provided:
=over 4
=item $p = HTML::PullParser->new( file => $file, %options )
=item $p = HTML::PullParser->new( doc => \$doc, %options )
A C<HTML::PullParser> can be made to parse from either a file or a
literal document based on whether the C<file> or C<doc> option is
passed to the parser's constructor.
The C<file> passed in can either be a file name or a file handle
object. If a file name is passed, and it can't be opened for reading,
then the constructor will return an undefined value and $! will tell
you why it failed. Otherwise the argument is taken to be some object
that the C<HTML::PullParser> can read() from when it needs more data.
The stream will be read() until EOF, but not closed.
A C<doc> can be passed plain or as a reference
to a scalar. If a reference is passed then the value of this scalar
should not be changed before all tokens have been extracted.
Next the information to be returned for the different token types must
be set up. This is done by simply associating an argspec (as defined
in L<HTML::Parser>) with the events you have an interest in. For
instance, if you want C<start> tokens to be reported as the string
C<'S'> followed by the tagname and the attributes you might pass an
C<start>-option like this:
$p = HTML::PullParser->new(
doc => $document_to_parse,
start => '"S", tagname, @attr',
end => '"E", tagname',
);
At last other C<HTML::Parser> options, like C<ignore_tags>, and
C<unbroken_text>, can be passed in. Note that you should not use the
I<event>_h options to set up parser handlers. That would confuse the
inner logic of C<HTML::PullParser>.
=item $token = $p->get_token
This method will return the next I<token> found in the HTML document,
or C<undef> at the end of the document. The token is returned as an
array reference. The content of this array match the argspec set up
during C<HTML::PullParser> construction.
=item $p->unget_token( @tokens )
If you find out you have read too many tokens you can push them back,
so that they are returned again the next time $p->get_token is called.
=back
=head1 EXAMPLES
The 'eg/hform' script shows how we might parse the form section of
HTML::Documents using HTML::PullParser.
=head1 SEE ALSO
L<HTML::Parser>, L<HTML::TokeParser>
=head1 COPYRIGHT
Copyright 1998-2001 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut

471
database/perl/vendor/lib/HTML/Tagset.pm vendored Normal file
View File

@@ -0,0 +1,471 @@
package HTML::Tagset;
use strict;
=head1 NAME
HTML::Tagset - data tables useful in parsing HTML
=head1 VERSION
Version 3.20
=cut
use vars qw( $VERSION );
$VERSION = '3.20';
=head1 SYNOPSIS
use HTML::Tagset;
# Then use any of the items in the HTML::Tagset package
# as need arises
=head1 DESCRIPTION
This module contains several data tables useful in various kinds of
HTML parsing operations.
Note that all tag names used are lowercase.
In the following documentation, a "hashset" is a hash being used as a
set -- the hash conveys that its keys are there, and the actual values
associated with the keys are not significant. (But what values are
there, are always true.)
=cut
use vars qw(
$VERSION
%emptyElement %optionalEndTag %linkElements %boolean_attr
%isHeadElement %isBodyElement %isPhraseMarkup
%is_Possible_Strict_P_Content
%isHeadOrBodyElement
%isList %isTableElement %isFormElement
%isKnown %canTighten
@p_closure_barriers
%isCDATA_Parent
);
=head1 VARIABLES
Note that none of these variables are exported.
=head2 hashset %HTML::Tagset::emptyElement
This hashset has as values the tag-names (GIs) of elements that cannot
have content. (For example, "base", "br", "hr".) So
C<$HTML::Tagset::emptyElement{'hr'}> exists and is true.
C<$HTML::Tagset::emptyElement{'dl'}> does not exist, and so is not true.
=cut
%emptyElement = map {; $_ => 1 } qw(base link meta isindex
img br hr wbr
input area param
embed bgsound spacer
basefont col frame
~comment ~literal
~declaration ~pi
);
# The "~"-initial names are for pseudo-elements used by HTML::Entities
# and TreeBuilder
=head2 hashset %HTML::Tagset::optionalEndTag
This hashset lists tag-names for elements that can have content, but whose
end-tags are generally, "safely", omissible. Example:
C<$HTML::Tagset::emptyElement{'li'}> exists and is true.
=cut
%optionalEndTag = map {; $_ => 1 } qw(p li dt dd); # option th tr td);
=head2 hash %HTML::Tagset::linkElements
Values in this hash are tagnames for elements that might contain
links, and the value for each is a reference to an array of the names
of attributes whose values can be links.
=cut
%linkElements =
(
'a' => ['href'],
'applet' => ['archive', 'codebase', 'code'],
'area' => ['href'],
'base' => ['href'],
'bgsound' => ['src'],
'blockquote' => ['cite'],
'body' => ['background'],
'del' => ['cite'],
'embed' => ['pluginspage', 'src'],
'form' => ['action'],
'frame' => ['src', 'longdesc'],
'iframe' => ['src', 'longdesc'],
'ilayer' => ['background'],
'img' => ['src', 'lowsrc', 'longdesc', 'usemap'],
'input' => ['src', 'usemap'],
'ins' => ['cite'],
'isindex' => ['action'],
'head' => ['profile'],
'layer' => ['background', 'src'],
'link' => ['href'],
'object' => ['classid', 'codebase', 'data', 'archive', 'usemap'],
'q' => ['cite'],
'script' => ['src', 'for'],
'table' => ['background'],
'td' => ['background'],
'th' => ['background'],
'tr' => ['background'],
'xmp' => ['href'],
);
=head2 hash %HTML::Tagset::boolean_attr
This hash (not hashset) lists what attributes of what elements can be
printed without showing the value (for example, the "noshade" attribute
of "hr" elements). For elements with only one such attribute, its value
is simply that attribute name. For elements with many such attributes,
the value is a reference to a hashset containing all such attributes.
=cut
%boolean_attr = (
# TODO: make these all hashes
'area' => 'nohref',
'dir' => 'compact',
'dl' => 'compact',
'hr' => 'noshade',
'img' => 'ismap',
'input' => { 'checked' => 1, 'readonly' => 1, 'disabled' => 1 },
'menu' => 'compact',
'ol' => 'compact',
'option' => 'selected',
'select' => 'multiple',
'td' => 'nowrap',
'th' => 'nowrap',
'ul' => 'compact',
);
#==========================================================================
# List of all elements from Extensible HTML version 1.0 Transitional DTD:
#
# a abbr acronym address applet area b base basefont bdo big
# blockquote body br button caption center cite code col colgroup
# dd del dfn dir div dl dt em fieldset font form h1 h2 h3 h4 h5 h6
# head hr html i iframe img input ins isindex kbd label legend li
# link map menu meta noframes noscript object ol optgroup option p
# param pre q s samp script select small span strike strong style
# sub sup table tbody td textarea tfoot th thead title tr tt u ul
# var
#
# Varia from Mozilla source internal table of tags:
# Implemented:
# xmp listing wbr nobr frame frameset noframes ilayer
# layer nolayer spacer embed multicol
# But these are unimplemented:
# sound?? keygen?? server??
# Also seen here and there:
# marquee?? app?? (both unimplemented)
#==========================================================================
=head2 hashset %HTML::Tagset::isPhraseMarkup
This hashset contains all phrasal-level elements.
=cut
%isPhraseMarkup = map {; $_ => 1 } qw(
span abbr acronym q sub sup
cite code em kbd samp strong var dfn strike
b i u s tt small big
a img br
wbr nobr blink
font basefont bdo
spacer embed noembed
); # had: center, hr, table
=head2 hashset %HTML::Tagset::is_Possible_Strict_P_Content
This hashset contains all phrasal-level elements that be content of a
P element, for a strict model of HTML.
=cut
%is_Possible_Strict_P_Content = (
%isPhraseMarkup,
%isFormElement,
map {; $_ => 1} qw( object script map )
# I've no idea why there's these latter exceptions.
# I'm just following the HTML4.01 DTD.
);
#from html4 strict:
#<!ENTITY % fontstyle "TT | I | B | BIG | SMALL">
#
#<!ENTITY % phrase "EM | STRONG | DFN | CODE |
# SAMP | KBD | VAR | CITE | ABBR | ACRONYM" >
#
#<!ENTITY % special
# "A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO">
#
#<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
#
#<!-- %inline; covers inline or "text-level" elements -->
#<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">
=head2 hashset %HTML::Tagset::isHeadElement
This hashset contains all elements that elements that should be
present only in the 'head' element of an HTML document.
=cut
%isHeadElement = map {; $_ => 1 }
qw(title base link meta isindex script style object bgsound);
=head2 hashset %HTML::Tagset::isList
This hashset contains all elements that can contain "li" elements.
=cut
%isList = map {; $_ => 1 } qw(ul ol dir menu);
=head2 hashset %HTML::Tagset::isTableElement
This hashset contains all elements that are to be found only in/under
a "table" element.
=cut
%isTableElement = map {; $_ => 1 }
qw(tr td th thead tbody tfoot caption col colgroup);
=head2 hashset %HTML::Tagset::isFormElement
This hashset contains all elements that are to be found only in/under
a "form" element.
=cut
%isFormElement = map {; $_ => 1 }
qw(input select option optgroup textarea button label);
=head2 hashset %HTML::Tagset::isBodyMarkup
This hashset contains all elements that are to be found only in/under
the "body" element of an HTML document.
=cut
%isBodyElement = map {; $_ => 1 } qw(
h1 h2 h3 h4 h5 h6
p div pre plaintext address blockquote
xmp listing
center
multicol
iframe ilayer nolayer
bgsound
hr
ol ul dir menu li
dl dt dd
ins del
fieldset legend
map area
applet param object
isindex script noscript
table
center
form
),
keys %isFormElement,
keys %isPhraseMarkup, # And everything phrasal
keys %isTableElement,
;
=head2 hashset %HTML::Tagset::isHeadOrBodyElement
This hashset includes all elements that I notice can fall either in
the head or in the body.
=cut
%isHeadOrBodyElement = map {; $_ => 1 }
qw(script isindex style object map area param noscript bgsound);
# i.e., if we find 'script' in the 'body' or the 'head', don't freak out.
=head2 hashset %HTML::Tagset::isKnown
This hashset lists all known HTML elements.
=cut
%isKnown = (%isHeadElement, %isBodyElement,
map{; $_=>1 }
qw( head body html
frame frameset noframes
~comment ~pi ~directive ~literal
));
# that should be all known tags ever ever
=head2 hashset %HTML::Tagset::canTighten
This hashset lists elements that might have ignorable whitespace as
children or siblings.
=cut
%canTighten = %isKnown;
delete @canTighten{
keys(%isPhraseMarkup), 'input', 'select',
'xmp', 'listing', 'plaintext', 'pre',
};
# xmp, listing, plaintext, and pre are untightenable, and
# in a really special way.
@canTighten{'hr','br'} = (1,1);
# exceptional 'phrasal' things that ARE subject to tightening.
# The one case where I can think of my tightening rules failing is:
# <p>foo bar<center> <em>baz quux</em> ...
# ^-- that would get deleted.
# But that's pretty gruesome code anyhow. You gets what you pays for.
#==========================================================================
=head2 array @HTML::Tagset::p_closure_barriers
This array has a meaning that I have only seen a need for in
C<HTML::TreeBuilder>, but I include it here on the off chance that someone
might find it of use:
When we see a "E<lt>pE<gt>" token, we go lookup up the lineage for a p
element we might have to minimize. At first sight, we might say that
if there's a p anywhere in the lineage of this new p, it should be
closed. But that's wrong. Consider this document:
<html>
<head>
<title>foo</title>
</head>
<body>
<p>foo
<table>
<tr>
<td>
foo
<p>bar
</td>
</tr>
</table>
</p>
</body>
</html>
The second p is quite legally inside a much higher p.
My formalization of the reason why this is legal, but this:
<p>foo<p>bar</p></p>
isn't, is that something about the table constitutes a "barrier" to
the application of the rule about what p must minimize.
So C<@HTML::Tagset::p_closure_barriers> is the list of all such
barrier-tags.
=cut
@p_closure_barriers = qw(
li blockquote
ul ol menu dir
dl dt dd
td th tr table caption
div
);
# In an ideal world (i.e., XHTML) we wouldn't have to bother with any of this
# monkey business of barriers to minimization!
=head2 hashset %isCDATA_Parent
This hashset includes all elements whose content is CDATA.
=cut
%isCDATA_Parent = map {; $_ => 1 }
qw(script style xmp listing plaintext);
# TODO: there's nothing else that takes CDATA children, right?
# As the HTML3 DTD (Raggett 1995-04-24) noted:
# The XMP, LISTING and PLAINTEXT tags are incompatible with SGML
# and derive from very early versions of HTML. They require non-
# standard parsers and will cause problems for processing
# documents with standard SGML tools.
=head1 CAVEATS
You may find it useful to alter the behavior of modules (like
C<HTML::Element> or C<HTML::TreeBuilder>) that use C<HTML::Tagset>'s
data tables by altering the data tables themselves. You are welcome
to try, but be careful; and be aware that different modules may or may
react differently to the data tables being changed.
Note that it may be inappropriate to use these tables for I<producing>
HTML -- for example, C<%isHeadOrBodyElement> lists the tagnames
for all elements that can appear either in the head or in the body,
such as "script". That doesn't mean that I am saying your code that
produces HTML should feel free to put script elements in either place!
If you are producing programs that spit out HTML, you should be
I<intimately> familiar with the DTDs for HTML or XHTML (available at
C<http://www.w3.org/>), and you should slavishly obey them, not
the data tables in this document.
=head1 SEE ALSO
L<HTML::Element>, L<HTML::TreeBuilder>, L<HTML::LinkExtor>
=head1 COPYRIGHT & LICENSE
Copyright 1995-2000 Gisle Aas.
Copyright 2000-2005 Sean M. Burke.
Copyright 2005-2008 Andy Lester.
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
=head1 ACKNOWLEDGEMENTS
Most of the code/data in this module was adapted from code written
by Gisle Aas for C<HTML::Element>, C<HTML::TreeBuilder>, and
C<HTML::LinkExtor>. Then it was maintained by Sean M. Burke.
=head1 AUTHOR
Current maintainer: Andy Lester, C<< <andy at petdance.com> >>
=head1 BUGS
Please report any bugs or feature requests to
C<bug-html-tagset at rt.cpan.org>, or through the web interface at
L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Tagset>. I will
be notified, and then you'll automatically be notified of progress on
your bug as I make changes.
=cut
1;

View File

@@ -0,0 +1,372 @@
package HTML::TokeParser;
use strict;
require HTML::PullParser;
our @ISA = qw(HTML::PullParser);
our $VERSION = '3.75';
use Carp ();
use HTML::Entities qw(decode_entities);
use HTML::Tagset ();
my %ARGS =
(
start => "'S',tagname,attr,attrseq,text",
end => "'E',tagname,text",
text => "'T',text,is_cdata",
process => "'PI',token0,text",
comment => "'C',text",
declaration => "'D',text",
# options that default on
unbroken_text => 1,
);
sub new
{
my $class = shift;
my %cnf;
if (@_ == 1) {
my $type = (ref($_[0]) eq "SCALAR") ? "doc" : "file";
%cnf = ($type => $_[0]);
}
else {
unshift @_, (ref($_[0]) eq "SCALAR") ? "doc" : "file" if(scalar(@_) % 2 == 1);
%cnf = @_;
}
my $textify = delete $cnf{textify} || {img => "alt", applet => "alt"};
my $self = $class->SUPER::new(%ARGS, %cnf) || return undef;
$self->{textify} = $textify;
$self;
}
sub get_tag
{
my $self = shift;
my $token;
while (1) {
$token = $self->get_token || return undef;
my $type = shift @$token;
next unless $type eq "S" || $type eq "E";
substr($token->[0], 0, 0) = "/" if $type eq "E";
return $token unless @_;
for (@_) {
return $token if $token->[0] eq $_;
}
}
}
sub _textify {
my($self, $token) = @_;
my $tag = $token->[1];
return undef unless exists $self->{textify}{$tag};
my $alt = $self->{textify}{$tag};
my $text;
if (ref($alt)) {
$text = &$alt(@$token);
} else {
$text = $token->[2]{$alt || "alt"};
$text = "[\U$tag]" unless defined $text;
}
return $text;
}
sub get_text
{
my $self = shift;
my @text;
while (my $token = $self->get_token) {
my $type = $token->[0];
if ($type eq "T") {
my $text = $token->[1];
decode_entities($text) unless $token->[2];
push(@text, $text);
} elsif ($type =~ /^[SE]$/) {
my $tag = $token->[1];
if ($type eq "S") {
if (defined(my $text = _textify($self, $token))) {
push(@text, $text);
next;
}
} else {
$tag = "/$tag";
}
if (!@_ || grep $_ eq $tag, @_) {
$self->unget_token($token);
last;
}
push(@text, " ")
if $tag eq "br" || !$HTML::Tagset::isPhraseMarkup{$token->[1]};
}
}
join("", @text);
}
sub get_trimmed_text
{
my $self = shift;
my $text = $self->get_text(@_);
$text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
$text;
}
sub get_phrase {
my $self = shift;
my @text;
while (my $token = $self->get_token) {
my $type = $token->[0];
if ($type eq "T") {
my $text = $token->[1];
decode_entities($text) unless $token->[2];
push(@text, $text);
} elsif ($type =~ /^[SE]$/) {
my $tag = $token->[1];
if ($type eq "S") {
if (defined(my $text = _textify($self, $token))) {
push(@text, $text);
next;
}
}
if (!$HTML::Tagset::isPhraseMarkup{$tag}) {
$self->unget_token($token);
last;
}
push(@text, " ") if $tag eq "br";
}
}
my $text = join("", @text);
$text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g;
$text;
}
1;
__END__
=head1 NAME
HTML::TokeParser - Alternative HTML::Parser interface
=head1 SYNOPSIS
require HTML::TokeParser;
$p = HTML::TokeParser->new("index.html") ||
die "Can't open: $!";
$p->empty_element_tags(1); # configure its behaviour
while (my $token = $p->get_token) {
#...
}
=head1 DESCRIPTION
The C<HTML::TokeParser> is an alternative interface to the
C<HTML::Parser> class. It is an C<HTML::PullParser> subclass with a
predeclared set of token types. If you wish the tokens to be reported
differently you probably want to use the C<HTML::PullParser> directly.
The following methods are available:
=over 4
=item $p = HTML::TokeParser->new( $filename, %opt );
=item $p = HTML::TokeParser->new( $filehandle, %opt );
=item $p = HTML::TokeParser->new( \$document, %opt );
The object constructor argument is either a file name, a file handle
object, or the complete document to be parsed. Extra options can be
provided as key/value pairs and are processed as documented by the base
classes.
If the argument is a plain scalar, then it is taken as the name of a
file to be opened and parsed. If the file can't be opened for
reading, then the constructor will return C<undef> and $! will tell
you why it failed.
If the argument is a reference to a plain scalar, then this scalar is
taken to be the literal document to parse. The value of this
scalar should not be changed before all tokens have been extracted.
Otherwise the argument is taken to be some object that the
C<HTML::TokeParser> can read() from when it needs more data. Typically
it will be a filehandle of some kind. The stream will be read() until
EOF, but not closed.
A newly constructed C<HTML::TokeParser> differ from its base classes
by having the C<unbroken_text> attribute enabled by default. See
L<HTML::Parser> for a description of this and other attributes that
influence how the document is parsed. It is often a good idea to enable
C<empty_element_tags> behaviour.
Note that the parsing result will likely not be valid if raw undecoded
UTF-8 is used as a source. When parsing UTF-8 encoded files turn
on UTF-8 decoding:
open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html': $!";
my $p = HTML::TokeParser->new( $fh );
# ...
If a $filename is passed to the constructor the file will be opened in
raw mode and the parsing result will only be valid if its content is
Latin-1 or pure ASCII.
If parsing from an UTF-8 encoded string buffer decode it first:
utf8::decode($document);
my $p = HTML::TokeParser->new( \$document );
# ...
=item $p->get_token
This method will return the next I<token> found in the HTML document,
or C<undef> at the end of the document. The token is returned as an
array reference. The first element of the array will be a string
denoting the type of this token: "S" for start tag, "E" for end tag,
"T" for text, "C" for comment, "D" for declaration, and "PI" for
process instructions. The rest of the token array depend on the type
like this:
["S", $tag, $attr, $attrseq, $text]
["E", $tag, $text]
["T", $text, $is_data]
["C", $text]
["D", $text]
["PI", $token0, $text]
where $attr is a hash reference, $attrseq is an array reference and
the rest are plain scalars. The L<HTML::Parser/Argspec> explains the
details.
=item $p->unget_token( @tokens )
If you find you have read too many tokens you can push them back,
so that they are returned the next time $p->get_token is called.
=item $p->get_tag
=item $p->get_tag( @tags )
This method returns the next start or end tag (skipping any other
tokens), or C<undef> if there are no more tags in the document. If
one or more arguments are given, then we skip tokens until one of the
specified tag types is found. For example:
$p->get_tag("font", "/font");
will find the next start or end tag for a font-element.
The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:
[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is
returned like this:
["/$tag", $text]
=item $p->get_text
=item $p->get_text( @endtags )
This method returns all text found at the current position. It will
return a zero length string if the next token is not text. Any
entities will be converted to their corresponding character.
If one or more arguments are given, then we return all text occurring
before the first of the specified tags found. For example:
$p->get_text("p", "br");
will return the text up to either a paragraph of line break element.
The text might span tags that should be I<textified>. This is
controlled by the $p->{textify} attribute, which is a hash that
defines how certain tags can be treated as text. If the name of a
start tag matches a key in this hash then this tag is converted to
text. The hash value is used to specify which tag attribute to obtain
the text from. If this tag attribute is missing, then the upper case
name of the tag enclosed in brackets is returned, e.g. "[IMG]". The
hash value can also be a subroutine reference. In this case the
routine is called with the start tag token content as its argument and
the return value is treated as the text.
The default $p->{textify} value is:
{img => "alt", applet => "alt"}
This means that <IMG> and <APPLET> tags are treated as text, and that
the text to substitute can be found in the ALT attribute.
=item $p->get_trimmed_text
=item $p->get_trimmed_text( @endtags )
Same as $p->get_text above, but will collapse any sequences of white
space to a single space character. Leading and trailing white space is
removed.
=item $p->get_phrase
This will return all text found at the current position ignoring any
phrasal-level tags. Text is extracted until the first non
phrasal-level tag. Textification of tags is the same as for
get_text(). This method will collapse white space in the same way as
get_trimmed_text() does.
The definition of <i>phrasal-level tags</i> is obtained from the
HTML::Tagset module.
=back
=head1 EXAMPLES
This example extracts all links from a document. It will print one
line for each link, containing the URL and the textual description
between the <A>...</A> tags:
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}
This example extract the <TITLE> from the document:
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");
if ($p->get_tag("title")) {
my $title = $p->get_trimmed_text;
print "Title: $title\n";
}
=head1 SEE ALSO
L<HTML::PullParser>, L<HTML::Parser>
=head1 COPYRIGHT
Copyright 1998-2005 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=cut

256
database/perl/vendor/lib/HTML/Tree.pm vendored Normal file
View File

@@ -0,0 +1,256 @@
package HTML::Tree;
# ABSTRACT: build and scan parse-trees of HTML
# HTML::Tree is basically just a happy alias to HTML::TreeBuilder.
use warnings;
use strict;
our $VERSION = '5.07'; # VERSION from OurPkgVersion
use HTML::TreeBuilder ();
sub new {
shift;
unshift @_, 'HTML::TreeBuilder';
goto &HTML::TreeBuilder::new;
}
sub new_from_file {
shift;
unshift @_, 'HTML::TreeBuilder';
goto &HTML::TreeBuilder::new_from_file;
}
sub new_from_content {
shift;
unshift @_, 'HTML::TreeBuilder';
goto &HTML::TreeBuilder::new_from_content;
}
sub new_from_url {
shift;
unshift @_, 'HTML::TreeBuilder';
goto &HTML::TreeBuilder::new_from_url;
}
1;
__END__
=pod
=head1 NAME
HTML::Tree - build and scan parse-trees of HTML
=head1 VERSION
This document describes version 5.07 of
HTML::Tree, released August 31, 2017
as part of L<HTML-Tree|HTML::Tree>.
=head1 SYNOPSIS
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($filename);
# Then do something with the tree, using HTML::Element
# methods -- for example:
$tree->dump
# Finally:
$tree->delete;
=head1 DESCRIPTION
HTML-Tree is a suite of Perl modules for making parse trees out of
HTML source. It consists of mainly two modules, whose documentation
you should refer to: L<HTML::TreeBuilder|HTML::TreeBuilder>
and L<HTML::Element|HTML::Element>.
HTML::TreeBuilder is the module that builds the parse trees. (It uses
HTML::Parser to do the work of breaking the HTML up into tokens.)
The tree that TreeBuilder builds for you is made up of objects of the
class HTML::Element.
If you find that you do not properly understand the documentation
for HTML::TreeBuilder and HTML::Element, it may be because you are
unfamiliar with tree-shaped data structures, or with object-oriented
modules in general. Sean Burke has written some articles for
I<The Perl Journal> (C<www.tpj.com>) that seek to provide that background.
The full text of those articles is contained in this distribution, as:
=over 4
=item L<HTML::Tree::AboutObjects|HTML::Tree::AboutObjects>
"User's View of Object-Oriented Modules" from TPJ17.
=item L<HTML::Tree::AboutTrees|HTML::Tree::AboutTrees>
"Trees" from TPJ18
=item L<HTML::Tree::Scanning|HTML::Tree::Scanning>
"Scanning HTML" from TPJ19
=back
Readers already familiar with object-oriented modules and tree-shaped
data structures should read just the last article. Readers without
that background should read the first, then the second, and then the
third.
=head1 METHODS
All these methods simply redirect to the corresponding method in
HTML::TreeBuilder. It's more efficient to use HTML::TreeBuilder
directly, and skip loading HTML::Tree at all.
=head2 new
Redirects to L<HTML::TreeBuilder/new>.
=head2 new_from_file
Redirects to L<HTML::TreeBuilder/new_from_file>.
=head2 new_from_content
Redirects to L<HTML::TreeBuilder/new_from_content>.
=head2 new_from_url
Redirects to L<HTML::TreeBuilder/new_from_url>.
=head1 SUPPORT
You can find documentation for this module with the perldoc command.
perldoc HTML::Tree
You can also look for information at:
=over 4
=item * AnnoCPAN: Annotated CPAN documentation
L<http://annocpan.org/dist/HTML-Tree>
=item * CPAN Ratings
L<http://cpanratings.perl.org/d/HTML-Tree>
=item * RT: CPAN's request tracker
L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Tree>
=item * Search CPAN
L<http://search.cpan.org/dist/HTML-Tree>
=item * Stack Overflow
L<http://stackoverflow.com/questions/tagged/html-tree>
If you have a question about how to use HTML-Tree, Stack Overflow is
the place to ask it. Make sure you tag it both C<perl> and C<html-tree>.
=back
=head1 SEE ALSO
L<HTML::TreeBuilder>, L<HTML::Element>, L<HTML::Tagset>,
L<HTML::Parser>, L<HTML::DOMbo>
The book I<Perl & LWP> by Sean M. Burke published by
O'Reilly and Associates, 2002. ISBN: 0-596-00178-9
It has several chapters to do with HTML processing in general,
and HTML-Tree specifically. There's more info at:
http://www.oreilly.com/catalog/perllwp/
http://www.amazon.com/exec/obidos/ASIN/0596001789
=head1 SOURCE REPOSITORY
HTML-Tree is now maintained using Git. The main public repository is
L<< https://github.com/kentfredric/HTML-Tree >>.
The best way to send a patch is to make a pull request there.
=head1 ACKNOWLEDGEMENTS
Thanks to Gisle Aas, Sean Burke and Andy Lester for their original work.
Thanks to Chicago Perl Mongers (http://chicago.pm.org) for their
patches submitted to HTML::Tree as part of the Phalanx project
(http://qa.perl.org/phalanx).
Thanks to the following people for additional patches and documentation:
Terrence Brannon, Gordon Lack, Chris Madsen and Ricardo Signes.
=head1 AUTHOR
Current maintainers:
=over
=item * Christopher J. Madsen S<C<< <perl AT cjmweb.net> >>>
=item * Jeff Fearn S<C<< <jfearn AT cpan.org> >>>
=back
Original HTML-Tree author:
=over
=item * Gisle Aas
=back
Former maintainers:
=over
=item * Sean M. Burke
=item * Andy Lester
=item * Pete Krawczyk S<C<< <petek AT cpan.org> >>>
=back
You can follow or contribute to HTML-Tree's development at
L<< https://github.com/kentfredric/HTML-Tree >>.
=head1 COPYRIGHT AND LICENSE
Copyright 1995-1998 Gisle Aas, 1999-2004 Sean M. Burke,
2005 Andy Lester, 2006 Pete Krawczyk, 2010 Jeff Fearn,
2012 Christopher J. Madsen.
(Except the articles contained in HTML::Tree::AboutObjects,
HTML::Tree::AboutTrees, and HTML::Tree::Scanning, which are all
copyright 2000 The Perl Journal.)
Except for those three TPJ articles, the whole HTML-Tree distribution,
of which this file is a part, is free software; you can redistribute
it and/or modify it under the same terms as Perl itself.
Those three TPJ articles may be distributed under the same terms as
Perl itself.
The programs in this library are distributed in the hope that they
will be useful, but without any warranty; without even the implied
warranty of merchantability or fitness for a particular purpose.
=cut

View File

@@ -0,0 +1,686 @@
#Time-stamp: "2001-02-23 20:07:25 MST" -*-Text-*-
# This document contains text in Perl "POD" format.
# Use a POD viewer like perldoc or perlman to render it.
=head1 NAME
HTML::Tree::AboutObjects -- article: "User's View of Object-Oriented Modules"
=head1 SYNOPSIS
# This an article, not a module.
=head1 DESCRIPTION
The following article by Sean M. Burke first appeared in I<The Perl
Journal> #17 and is copyright 2000 The Perl Journal. It appears
courtesy of Jon Orwant and The Perl Journal. This document may be
distributed under the same terms as Perl itself.
=head1 A User's View of Object-Oriented Modules
-- Sean M. Burke
The first time that most Perl programmers run into object-oriented
programming when they need to use a module whose interface is
object-oriented. This is often a mystifying experience, since talk of
"methods" and "constructors" is unintelligible to programmers who
thought that functions and variables was all there was to worry about.
Articles and books that explain object-oriented programming (OOP), do so
in terms of how to program that way. That's understandable, and if you
learn to write object-oriented code of your own, you'd find it easy to
use object-oriented code that others write. But this approach is the
I<long> way around for people whose immediate goal is just to use
existing object-oriented modules, but who don't yet want to know all the
gory details of having to write such modules for themselves.
This article is for those programmers -- programmers who want to know
about objects from the perspective of using object-oriented modules.
=head2 Modules and Their Functional Interfaces
Modules are the main way that Perl provides for bundling up code for
later use by yourself or others. As I'm sure you can't help noticing
from reading
I<The Perl Journal>, CPAN (the Comprehensive Perl Archive
Network) is the repository for modules (or groups of modules) that
others have written, to do anything from composing music to accessing
Web pages. A good deal of those modules even come with every
installation of Perl.
One module that you may have used before, and which is fairly typical in
its interface, is Text::Wrap. It comes with Perl, so you don't even
need to install it from CPAN. You use it in a program of yours, by
having your program code say early on:
use Text::Wrap;
and after that, you can access a function called C<wrap>, which inserts
line-breaks in text that you feed it, so that the text will be wrapped to
seventy-two (or however many) columns.
The way this C<use Text::Wrap> business works is that the module
Text::Wrap exists as a file "Text/Wrap.pm" somewhere in one of your
library directories. That file contains Perl code...
=over
Footnote: And mixed in with the Perl code, there's documentation, which
is what you read with "perldoc Text::Wrap". The perldoc program simply
ignores the code and formats the documentation text, whereas "use
Text::Wrap" loads and runs the code while ignoring the documentation.
=back
...which, among other things, defines a function called C<Text::Wrap::wrap>,
and then C<exports> that function, which means that when you say C<wrap>
after having said "use Text::Wrap", you'll be actually calling the
C<Text::Wrap::wrap> function. Some modules don't export their
functions, so you have to call them by their full name, like
C<Text::Wrap::wrap(...parameters...)>.
Regardless of whether the typical module exports the functions it
provides, a module is basically just a container for chunks of code that
do useful things. The way the module allows for you to interact with
it, is its I<interface>. And when, like with Text::Wrap, its interface
consists of functions, the module is said to have a B<functional
interface>.
=over
Footnote: the term "function" (and therefore "functionI<al>") has
various senses. I'm using the term here in its broadest sense, to
refer to routines -- bits of code that are called by some name and
which take parameters and return some value.
=back
Using modules with functional interfaces is straightforward -- instead
of defining your own "wrap" function with C<sub wrap { ... }>, you
entrust "use Text::Wrap" to do that for you, along with whatever other
functions its defines and exports, according to the module's
documentation. Without too much bother, you can even write your own
modules to contain your frequently used functions; I suggest having a look at
the C<perlmod> man page for more leads on doing this.
=head2 Modules with Object-Oriented Interfaces
So suppose that one day you want to write a program that will automate
the process of C<ftp>ing a bunch of files from one server down to your
local machine, and then off to another server.
A quick browse through search.cpan.org turns up the module "Net::FTP",
which you can download and install it using normal installation
instructions (unless your sysadmin has already installed it, as many
have).
Like Text::Wrap or any other module with a familiarly functional
interface, you start off using Net::FTP in your program by saying:
use Net::FTP;
However, that's where the similarity ends. The first hint of
difference is that the documentation for Net::FTP refers to it as a
B<class>. A class is a kind of module, but one that has an
object-oriented interface.
Whereas modules like Text::Wrap
provide bits of useful code as I<functions>, to be called like
C<function(...parameters...)> or like
C<PackageName::function(...parameters...)>, Net::FTP and other modules
with object-oriented interfaces provide B<methods>. Methods are sort of
like functions in that they have a name and parameters; but methods
look different, and are different, because you have to call them with a
syntax that has a class name or an object as a special argument. I'll
explain the syntax for method calls, and then later explain what they
all mean.
Some methods are meant to be called as B<class methods>, with the class
name (same as the module name) as a special argument. Class methods
look like this:
ClassName->methodname(parameter1, parameter2, ...)
ClassName->methodname() # if no parameters
ClassName->methodname # same as above
which you will sometimes see written:
methodname ClassName (parameter1, parameter2, ...)
methodname ClassName # if no parameters
Basically all class methods are for making new objects, and methods that
make objects are called "B<constructors>" (and the process of making them
is called "constructing" or "instantiating"). Constructor methods
typically have the name "new", or something including "new"
("new_from_file", etc.); but they can conceivably be named
anything -- DBI's constructor method is named "connect", for example.
The object that a constructor method returns is
typically captured in a scalar variable:
$object = ClassName->new(param1, param2...);
Once you have an object (more later on exactly what that is), you can
use the other kind of method call syntax, the syntax for B<object method>
calls. Calling object methods is just like class methods, except
that instead of the ClassName as the special argument,
you use an expression that yeilds an "object". Usually this is
just a scalar variable that you earlier captured the
output of the constructor in. Object method calls look like this:
$object->methodname(parameter1, parameter2, ...);
$object->methodname() # if no parameters
$object->methodname # same as above
which is occasionally written as:
methodname $object (parameter1, parameter2, ...)
methodname $object # if no parameters
Examples of method calls are:
my $session1 = Net::FTP->new("ftp.myhost.com");
# Calls a class method "new", from class Net::FTP,
# with the single parameter "ftp.myhost.com",
# and saves the return value (which is, as usual,
# an object), in $session1.
# Could also be written:
# new Net::FTP('ftp.myhost.com')
$session1->login("sburke","aoeuaoeu")
|| die "failed to login!\n";
# calling the object method "login"
print "Dir:\n", $session1->dir(), "\n";
$session1->quit;
# same as $session1->quit()
print "Done\n";
exit;
Incidentally, I suggest always using the syntaxes with parentheses and
"->" in them,
=over
Footnote: the character-pair "->" is supposed to look like an
arrow, not "negative greater-than"!
=back
and avoiding the syntaxes that start out "methodname $object" or
"methodname ModuleName". When everything's going right, they all mean
the same thing as the "->" variants, but the syntax with "->" is more
visually distinct from function calls, as well as being immune to some
kinds of rare but puzzling ambiguities that can arise when you're trying
to call methods that have the same name as subroutines you've defined.
But, syntactic alternatives aside, all this talk of constructing objects
and object methods begs the question -- what I<is> an object? There are
several angles to this question that the rest of this article will
answer in turn: what can you do with objects? what's in an object?
what's an object value? and why do some modules use objects at all?
=head2 What Can You Do with Objects?
You've seen that you can make objects, and call object methods with
them. But what are object methods for? The answer depends on the class:
A Net::FTP object represents a session between your computer and an FTP
server. So the methods you call on a Net::FTP object are for doing
whatever you'd need to do across an FTP connection. You make the
session and log in:
my $session = Net::FTP->new('ftp.aol.com');
die "Couldn't connect!" unless defined $session;
# The class method call to "new" will return
# the new object if it goes OK, otherwise it
# will return undef.
$session->login('sburke', 'p@ssw3rD')
|| die "Did I change my password again?";
# The object method "login" will give a true
# return value if actually logs in, otherwise
# it'll return false.
You can use the session object to change directory on that session:
$session->cwd("/home/sburke/public_html")
|| die "Hey, that was REALLY supposed to work!";
# if the cwd fails, it'll return false
...get files from the machine at the other end of the session...
foreach my $f ('log_report_ua.txt', 'log_report_dom.txt',
'log_report_browsers.txt')
{
$session->get($f) || warn "Getting $f failed!"
};
...and plenty else, ending finally with closing the connection:
$session->quit();
In short, object methods are for doing things related to (or with)
whatever the object represents. For FTP sessions, it's about sending
commands to the server at the other end of the connection, and that's
about it -- there, methods are for doing something to the world outside
the object, and the objects is just something that specifies what bit
of the world (well, what FTP session) to act upon.
With most other classes, however, the object itself stores some kind of
information, and it typically makes no sense to do things with such an
object without considering the data that's in the object.
=head2 What's I<in> an Object?
An object is (with rare exceptions) a data structure containing a
bunch of attributes, each of which has a value, as well as a name
that you use when you
read or set the attribute's value. Some of the object's attributes are
private, meaning you'll never see them documented because they're not
for you to read or write; but most of the object's documented attributes
are at least readable, and usually writeable, by you. Net::FTP objects
are a bit thin on attributes, so we'll use objects from the class
Business::US_Amort for this example. Business::US_Amort is a very
simple class (available from CPAN) that I wrote for making calculations
to do with loans (specifically, amortization, using US-style
algorithms).
An object of the class Business::US_Amort represents a loan with
particular parameters, i.e., attributes. The most basic attributes of a
"loan object" are its interest rate, its principal (how much money it's
for), and it's term (how long it'll take to repay). You need to set
these attributes before anything else can be done with the object. The
way to get at those attributes for loan objects is just like the
way to get at attributes for any class's objects: through accessors.
An B<accessor> is simply any method that accesses (whether reading or
writing, AKA getting or putting) some attribute in the given object.
Moreover, accessors are the B<only> way that you can change
an object's attributes. (If a module's documentation wants you to
know about any other way, it'll tell you.)
Usually, for simplicity's sake, an accessor is named after the attribute
it reads or writes. With Business::US_Amort objects, the accessors you
need to use first are C<principal>, C<interest_rate>, and C<term>.
Then, with at least those attributes set, you can call the C<run> method
to figure out several things about the loan. Then you can call various
accessors, like C<total_paid_toward_interest>, to read the results:
use Business::US_Amort;
my $loan = Business::US_Amort->new;
# Set the necessary attributes:
$loan->principal(123654);
$loan->interest_rate(9.25);
$loan->term(20); # twenty years
# NOW we know enough to calculate:
$loan->run;
# And see what came of that:
print
"Total paid toward interest: A WHOPPING ",
$loan->total_paid_interest, "!!\n";
This illustrates a convention that's common with accessors: calling the
accessor with no arguments (as with $loan->total_paid_interest) usually
means to read the value of that attribute, but providing a value (as
with $loan->term(20)) means you want that attribute to be set to that
value. This stands to reason: why would you be providing a value, if
not to set the attribute to that value?
Although a loan's term, principal, and interest rates are all single
numeric values, an objects values can any kind of scalar, or an array,
or even a hash. Moreover, an attribute's value(s) can be objects
themselves. For example, consider MIDI files (as I wrote about in
TPJ#13): a MIDI file usually consists of several tracks. A MIDI file is
complex enough to merit being an object with attributes like its overall
tempo, the file-format variant it's in, and the list of instrument
tracks in the file. But tracks themselves are complex enough to be
objects too, with attributes like their track-type, a list of MIDI
commands if they're a MIDI track, or raw data if they're not. So I
ended up writing the MIDI modules so that the "tracks" attribute of a
MIDI::Opus object is an array of objects from the class MIDI::Track.
This may seem like a runaround -- you ask what's in one object, and get
I<another> object, or several! But in this case, it exactly reflects
what the module is for -- MIDI files contain MIDI tracks, which then
contain data.
=head2 What is an Object Value?
When you call a constructor like Net::FTP->new(I<hostname>), you get
back an object value, a value you can later use, in combination with a
method name, to call object methods.
Now, so far we've been pretending, in the above examples, that the
variables $session or $loan I<are> the objects you're dealing with.
This idea is innocuous up to a point, but it's really a misconception
that will, at best, limit you in what you know how to do. The reality
is not that the variables $session or $query are objects; it's a little
more indirect -- they I<hold> values that symbolize objects. The kind of
value that $session or $query hold is what I'm calling an object value.
To understand what kind of value this is, first think about the other
kinds of scalar values you know about: The first two scalar values you
probably ever ran into in Perl are B<numbers> and B<strings>, which you
learned (or just assumed) will usually turn into each other on demand;
that is, the three-character string "2.5" can become the quantity two
and a half, and vice versa. Then, especially if you started using
C<perl -w> early on, you learned about the B<undefined value>, which can
turn into 0 if you treat it as a number, or the empty-string if you
treat it as a string.
=over
Footnote: You may I<also> have been learning about references, in which
case you're ready to hear that object values are just a kind of
reference, except that they reflect the class that created thing they point
to, instead of merely being a plain old array reference, hash reference,
etc. I<If> this makes makes sense to you, and you want to know more
about how objects are implemented in Perl, have a look at the
C<perltoot> man page.
=back
And now you're learning about B<object values>. An object value is a
value that points to a data structure somewhere in memory, which is
where all the attributes for this object are stored. That data
structure as a whole belongs to a class (probably the one you named in
the constructor method, like ClassName->new), so that the object value
can be used as part of object method calls.
If you want to actually I<see> what an object value is, you might try
just saying "print $object". That'll get you something like this:
Net::FTP=GLOB(0x20154240)
or
Business::US_Amort=HASH(0x15424020)
That's not very helpful if you wanted to really get at the object's
insides, but that's because the object value is only a symbol for the
object. This may all sound very abstruse and metaphysical, so a
real-world allegory might be very helpful:
=over
You get an advertisement in the mail saying that you have been
(im)personally selected to have the rare privilege of applying for a
credit card. For whatever reason, I<this> offer sounds good to you, so you
fill out the form and mail it back to the credit card company. They
gleefully approve the application and create your account, and send you
a card with a number on it.
Now, you can do things with the number on that card -- clerks at stores
can ring up things you want to buy, and charge your account by keying in
the number on the card. You can pay for things you order online by
punching in the card number as part of your online order. You can pay
off part of the account by sending the credit card people some of your
money (well, a check) with some note (usually the pre-printed slip)
that has the card number for the account you want to pay toward. And you
should be able to call the credit card company's computer and ask it
things about the card, like its balance, its credit limit, its APR, and
maybe an itemization of recent purchases ad payments.
Now, what you're I<really> doing is manipulating a credit card
I<account>, a completely abstract entity with some data attached to it
(balance, APR, etc). But for ease of access, you have a credit card
I<number> that is a symbol for that account. Now, that symbol is just a
bunch of digits, and the number is effectively meaningless and useless
in and of itself -- but in the appropriate context, it's understood to
I<mean> the credit card account you're accessing.
=back
This is exactly the relationship between objects and object values, and
from this analogy, several facts about object values are a bit more
explicable:
* An object value does nothing in and of itself, but it's useful when
you use it in the context of an $object->method call, the same way that
a card number is useful in the context of some operation dealing with a
card account.
Moreover, several copies of the same object value all refer to the same
object, the same way that making several copies of your card number
won't change the fact that they all still refer to the same single
account (this is true whether you're "copying" the number by just
writing it down on different slips of paper, or whether you go to the
trouble of forging exact replicas of your own plastic credit card). That's
why this:
$x = Net::FTP->new("ftp.aol.com");
$x->login("sburke", "aoeuaoeu");
does the same thing as this:
$x = Net::FTP->new("ftp.aol.com");
$y = $x;
$z = $y;
$z->login("sburke", "aoeuaoeu");
That is, $z and $y and $x are three different I<slots> for values,
but what's in those slots are all object values pointing to the same
object -- you don't have three different FTP connections, just three
variables with values pointing to the some single FTP connection.
* You can't tell much of anything about the object just by looking at
the object value, any more than you can see your credit account balance
by holding the plastic card up to the light, or by adding up the digits
in your credit card number.
* You can't just make up your own object values and have them work --
they can come only from constructor methods of the appropriate class.
Similarly, you get a credit card number I<only> by having a bank approve
your application for a credit card account -- at which point I<they>
let I<you> know what the number of your new card is.
Now, there's even more to the fact that you can't just make up your own
object value: even though you can print an object value and get a string
like "Net::FTP=GLOB(0x20154240)", that string is just a
I<representation> of an object value.
Internally, an object value has a basically different type from a
string, or a number, or the undefined value -- if $x holds a real
string, then that value's slot in memory says "this is a value of type
I<string>, and its characters are...", whereas if it's an object value,
the value's slot in memory says, "this is a value of type I<reference>,
and the location in memory that it points to is..." (and by looking at
what's at that location, Perl can tell the class of what's there).
Perl programmers typically don't have to think about all these details
of Perl's internals. Many other languages force you to be more
conscious of the differences between all of these (and also between
types of numbers, which are stored differently depending on their size
and whether they have fractional parts). But Perl does its best to
hide the different types of scalars from you -- it turns numbers into
strings and back as needed, and takes the string or number
representation of undef or of object values as needed. However, you
can't go from a string representation of an object value, back to an
object value. And that's why this doesn't work:
$x = Net::FTP->new('ftp.aol.com');
$y = Net::FTP->new('ftp.netcom.com');
$z = Net::FTP->new('ftp.qualcomm.com');
$all = join(' ', $x,$y,$z); # !!!
...later...
($aol, $netcom, $qualcomm) = split(' ', $all); # !!!
$aol->login("sburke", "aoeuaoeu");
$netcom->login("sburke", "qjkxqjkx");
$qualcomm->login("smb", "dhtndhtn");
This fails because $aol ends up holding merely the B<string representation>
of the object value from $x, not the object value itself -- when
C<join> tried to join the characters of the "strings" $x, $y, and $z,
Perl saw that they weren't strings at all, so it gave C<join> their
string representations.
Unfortunately, this distinction between object values and their string
representations doesn't really fit into the analogy of credit card
numbers, because credit card numbers really I<are> numbers -- even
thought they don't express any meaningful quantity, if you stored them
in a database as a quantity (as opposed to just an ASCII string),
that wouldn't stop them from being valid as credit card numbers.
This may seem rather academic, but there's there's two common mistakes
programmers new to objects often make, which make sense only in terms of
the distinction between object values and their string representations:
The first common error involves forgetting (or never having known in the
first place) that when you go to use a value as a hash key, Perl uses
the string representation of that value. When you want to use the
numeric value two and a half as a key, Perl turns it into the
three-character string "2.5". But if you then want to use that string
as a number, Perl will treat it as meaning two and a half, so you're
usually none the wiser that Perl converted the number to a string and
back. But recall that Perl can't turn strings back into objects -- so
if you tried to use a Net::FTP object value as a hash key, Perl actually
used its string representation, like "Net::FTP=GLOB(0x20154240)", but
that string is unusable as an object value. (Incidentally, there's
a module Tie::RefHash that implements hashes that I<do> let you use
real object-values as keys.)
The second common error with object values is in
trying to save an object value to disk (whether printing it to a
file, or storing it in a conventional database file). All you'll get is the
string, which will be useless.
When you want to save an object and restore it later, you may find that
the object's class already provides a method specifically for this. For
example, MIDI::Opus provides methods for writing an object to disk as a
standard MIDI file. The file can later be read back into memory by
a MIDI::Opus constructor method, which will return a new MIDI::Opus
object representing whatever file you tell it to read into memory.
Similar methods are available with, for example, classes that
manipulate graphic images and can save them to files, which can be read
back later.
But some classes, like Business::US_Amort, provide no such methods for
storing an object in a file. When this is the case, you can try
using any of the Data::Dumper, Storable, or FreezeThaw modules. Using
these will be unproblematic for objects of most classes, but it may run
into limitations with others. For example, a Business::US_Amort
object can be turned into a string with Data::Dumper, and that string
written to a file. When it's restored later, its attributes will be
accessible as normal. But in the unlikely case that the loan object was
saved in mid-calculation, the calculation may not be resumable. This is
because of the way that that I<particular> class does its calculations,
but similar limitations may occur with objects from other classses.
But often, even I<wanting> to save an object is basically wrong -- what would
saving an ftp I<session> even mean? Saving the hostname, username, and
password? current directory on both machines? the local TCP/IP port
number? In the case of "saving" a Net::FTP object, you're better off
just saving whatever details you actually need for your own purposes,
so that you can make a new object later and just set those values for it.
=head2 So Why Do Some Modules Use Objects?
All these details of using objects are definitely enough to make you
wonder -- is it worth the bother? If you're a module author, writing
your module with an object-oriented interface restricts the audience of
potential users to those who understand the basic concepts of objects
and object values, as well as Perl's syntax for calling methods. Why
complicate things by having an object-oriented interface?
A somewhat esoteric answer is that a module has an object-oriented
interface because the module's insides are written in an
object-oriented style. This article is about the basics of
object-oriented I<interfaces>, and it'd be going far afield to explain
what object-oriented I<design> is. But the short story is that
object-oriented design is just one way of attacking messy problems.
It's a way that many programmers find very helpful (and which others
happen to find to be far more of a hassle than it's worth,
incidentally), and it just happens to show up for you, the module user,
as merely the style of interface.
But a simpler answer is that a functional interface is sometimes a
hindrance, because it limits the number of things you can do at once --
limiting it, in fact, to one. For many problems that some modules are
meant to solve, doing without an object-oriented interface would be like
wishing that Perl didn't use filehandles. The ideas are rather simpler
-- just imagine that Perl let you access files, but I<only> one at a
time, with code like:
open("foo.txt") || die "Can't open foo.txt: $!";
while(readline) {
print $_ if /bar/;
}
close;
That hypothetical kind of Perl would be simpler, by doing without
filehandles. But you'd be out of luck if you wanted to read from
one file while reading from another, or read from two and print to a
third.
In the same way, a functional FTP module would be fine for just
uploading files to one server at a time, but it wouldn't allow you to
easily write programs that make need to use I<several> simultaneous
sessions (like "look at server A and server B, and if A has a file
called X.dat, then download it locally and then upload it to server B --
except if B has a file called Y.dat, in which case do it the other way
around").
Some kinds of problems that modules solve just lend themselves to an
object-oriented interface. For those kinds of tasks, a functional
interface would be more familiar, but less powerful. Learning to use
object-oriented modules' interfaces does require becoming comfortable
with the concepts from this article. But in the end it will allow you
to use a broader range of modules and, with them, to write programs
that can do more.
B<[end body of article]>
=head2 [Author Credit]
Sean M. Burke has contributed several modules to CPAN, about half of
them object-oriented.
[The next section should be in a greybox:]
=head2 The Gory Details
For sake of clarity of explanation, I had to oversimplify some of the
facts about objects. Here's a few of the gorier details:
* Every example I gave of a constructor was a class method. But object
methods can be constructors, too, if the class was written to work that
way: $new = $old->copy, $node_y = $node_x->new_subnode, or the like.
* I've given the impression that there's two kinds of methods: object
methods and class methods. In fact, the same method can be both,
because it's not the kind of method it is, but the kind of calls it's
written to accept -- calls that pass an object, or calls that pass a
class-name.
* The term "object value" isn't something you'll find used much anywhere
else. It's just my shorthand for what would properly be called an
"object reference" or "reference to a blessed item". In fact, people
usually say "object" when they properly mean a reference to that object.
* I mentioned creating objects with I<con>structors, but I didn't
mention destroying them with I<de>structor -- a destructor is a kind of
method that you call to tidy up the object once you're done with it, and
want it to neatly go away (close connections, delete temporary files,
free up memory, etc). But because of the way Perl handles memory,
most modules won't require the user to know about destructors.
* I said that class method syntax has to have the class name, as in
$session = B<Net::FTP>->new($host). Actually, you can instead use any
expression that returns a class name: $ftp_class = 'Net::FTP'; $session
= B<$ftp_class>->new($host). Moreover, instead of the method name for
object- or class-method calls, you can use a scalar holding the method
name: $foo->B<$method>($host). But, in practice, these syntaxes are
rarely useful.
And finally, to learn about objects from the perspective of writing
your own classes, see the C<perltoot> documentation,
or Damian Conway's exhaustive and clear book I<Object Oriented Perl>
(Manning Publications 1999, ISBN 1-884777-79-1).
=head1 BACK
Return to the L<HTML::Tree|HTML::Tree> docs.
=cut

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,714 @@
#Time-stamp: "2001-03-10 23:19:11 MST" -*-Text-*-
# This document contains text in Perl "POD" format.
# Use a POD viewer like perldoc or perlman to render it.
=head1 NAME
HTML::Tree::Scanning -- article: "Scanning HTML"
=head1 SYNOPSIS
# This an article, not a module.
=head1 DESCRIPTION
The following article by Sean M. Burke first appeared in I<The Perl
Journal> #19 and is copyright 2000 The Perl Journal. It appears
courtesy of Jon Orwant and The Perl Journal. This document may be
distributed under the same terms as Perl itself.
(Note that this is discussed in chapters 6 through 10 of the
book I<Perl and LWP> L<http://lwp.interglacial.com/> which
was written after the following documentation, and which is
available free online.)
=head1 Scanning HTML
-- Sean M. Burke
In I<The Perl Journal> issue 17, Ken MacFarlane's article "Parsing
HTML with HTML::Parser" describes how the HTML::Parser module scans
HTML source as a stream of start-tags, end-tags, text, comments, etc.
In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
data structures. Now I'll try to tie it together, in a discussion of
HTML trees.
The CPAN module HTML::TreeBuilder takes the
tags that HTML::Parser picks out, and builds a parse tree -- a
tree-shaped network of objects...
=over
Footnote:
And if you need a quick explanation of objects, see my TPJ17 article "A
User's View of Object-Oriented Modules"; or go whole hog and get Damian
Conway's excellent book I<Object-Oriented Perl>, from Manning
Publications.
=back
...representing the structured content of the HTML document. And once
the document is parsed as a tree, you'll find the common tasks
of extracting data from that HTML document/tree to be quite
straightforward.
=head2 HTML::Parser, HTML::TreeBuilder, and HTML::Element
You use HTML::TreeBuilder to make a parse tree out of an HTML source
file, by simply saying:
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file('foo.html');
and then C<$tree> contains a parse tree built from the HTML source from
the file "foo.html". The way this parse tree is represented is with a
network of objects -- C<$tree> is the root, an element with tag-name
"html", and its children typically include a "head" and "body" element,
and so on. Elements in the tree are objects of the class
HTML::Element.
So, if you take this source:
<html><head><title>Doc 1</title></head>
<body>
Stuff <hr> 2000-08-17
</body></html>
and feed it to HTML::TreeBuilder, it'll return a tree of objects that
looks like this:
html
/ \
head body
/ / | \
title "Stuff" hr "2000-08-17"
|
"Doc 1"
This is a pretty simple document, but if it were any more complex,
it'd be a bit hard to draw in that style, since it's sprawl left and
right. The same tree can be represented a bit more easily sideways,
with indenting:
. html
. head
. title
. "Doc 1"
. body
. "Stuff"
. hr
. "2000-08-17"
Either way expresses the same structure. In that structure, the root
node is an object of the class HTML::Element
=over
Footnote:
Well actually, the root is of the class HTML::TreeBuilder, but that's
just a subclass of HTML::Element, plus the few extra methods like
C<parse_file> that elaborate the tree
=back
, with the tag name "html", and with two children: an HTML::Element
object whose tag names are "head" and "body". And each of those
elements have children, and so on down. Not all elements (as we'll
call the objects of class HTML::Element) have children -- the "hr"
element doesn't. And note all nodes in the tree are elements -- the
text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
Objects of the class HTML::Element each have three noteworthy attributes:
=over
=item C<_tag> -- (best accessed as C<$e-E<gt>tag>)
this element's tag-name, lowercased (e.g., "em" for an "em" element).
=over
Footnote: Yes, this is misnamed. In proper SGML terminology, this is
instead called a "GI", short for "generic identifier"; and the term
"tag" is used for a token of SGML source that represents either
the start of an element (a start-tag like "<em lang='fr'>") or the end
of an element (an end-tag like "</em>". However, since more people
claim to have been abducted by aliens than to have ever seen the
SGML standard, and since both encounters typically involve a feeling of
"missing time", it's not surprising that the terminology of the SGML
standard is not closely followed.
=back
=item C<_parent> -- (best accessed as C<$e-E<gt>parent>)
the element that is C<$obj>'s parent, or undef if this element is the
root of its tree.
=item C<_content> -- (best accessed as C<$e-E<gt>content_list>)
the list of nodes (i.e., elements or text segments) that are C<$e>'s
children.
=back
Moreover, if an element object has any attributes in the SGML sense of
the word, then those are readable as C<$e-E<gt>attr('name')> -- for
example, with the object built from having parsed "E<lt>a
B<id='foo'>E<gt>barE<lt>/aE<gt>", C<$e-E<gt>attr('id')> will return
the string "foo". Moreover, C<$e-E<gt>tag> on that object returns the
string "a", C<$e-E<gt>content_list> returns a list consisting of just
the single scalar "bar", and C<$e-E<gt>parent> returns the object
that's this node's parent -- which may be, for example, a "p" element.
And that's all that there is to it -- you throw HTML
source at TreeBuilder, and it returns a tree built of HTML::Element
objects and some text strings.
However, what do you I<do> with a tree of objects? People code
information into HTML trees not for the fun of arranging elements, but
to represent the structure of specific text and images -- some text is
in this "li" element, some other text is in that heading, some
images are in that other table cell that has those attributes, and so on.
Now, it may happen that you're rendering that whole HTML tree into some
layout format. Or you could be trying to make some systematic change to
the HTML tree before dumping it out as HTML source again. But, in my
experience, by far the most common programming task that Perl
programmers face with HTML is in trying to extract some piece
of information from a larger document. Since that's so common (and
also since it involves concepts that are basic to more complex tasks),
that is what the rest of this article will be about.
=head2 Scanning HTML trees
Suppose you have a thousand HTML documents, each of them a press
release. They all start out:
[...lots of leading images and junk...]
<h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
of world conquest, Rock Feldspar, announced today the opening of a
new office in Ougadougou, the capital city of Burkino Faso, gateway
to the bustling "Silicon Sahara" of Africa...
[...etc...]
...and what you've got to do is, for each document, copy whatever text
is in the "h1" element, so that you can, for example, make a table of
contents of it. Now, there are three ways to do this:
=over
=item * You can just use a regexp to scan the file for a text pattern.
For many very simple tasks, this will do fine. Many HTML documents are,
in practice, very consistently formatted as far as placement of
linebreaks and whitespace, so you could just get away with scanning the
file like so:
sub get_heading {
my $filename = $_[0];
local *HTML;
open(HTML, $filename)
or die "Couldn't open $filename);
my $heading;
Line:
while(<HTML>) {
if( m{<h1>(.*?)</h1>}i ) { # match it!
$heading = $1;
last Line;
}
}
close(HTML);
warn "No heading in $filename?"
unless defined $heading;
return $heading;
}
This is quick and fast, but awfully fragile -- if there's a newline in
the middle of a heading's text, it won't match the above regexp, and
you'll get an error. The regexp will also fail if the "h1" element's
start-tag has any attributes. If you have to adapt your code to fit
more kinds of start-tags, you'll end up basically reinventing part of
HTML::Parser, at which point you should probably just stop, and use
HTML::Parser itself:
=item * You can use HTML::Parser to scan the file for an "h1" start-tag
token, then capture all the text tokens until the "h1" close-tag. This
approach is extensively covered in the Ken MacFarlane's TPJ17 article
"Parsing HTML with HTML::Parser". (A variant of this approach is to use
HTML::TokeParser, which presents a different and rather handier
interface to the tokens that HTML::Parser picks out.)
Using HTML::Parser is less fragile than our first approach, since it's
not sensitive to the exact internal formatting of the start-tag (much
less whether it's split across two lines). However, when you need more
information about the context of the "h1" element, or if you're having
to deal with any of the tricky bits of HTML, such as parsing of tables,
you'll find out the flat list of tokens that HTML::Parser returns
isn't immediately useful. To get something useful out of those tokens,
you'll need to write code that knows some things about what elements
take no content (as with "hr" elements), and that a "</p>" end-tags
are omissible, so a "<p>" will end any currently
open paragraph -- and you're well on your way to pointlessly
reinventing much of the code in HTML::TreeBuilder
=over
Footnote:
And, as the person who last rewrote that module, I can attest that it
wasn't terribly easy to get right! Never underestimate the perversity
of people coding HTML.
=back
, at which point you should probably just stop, and use
HTML::TreeBuilder itself:
=item * You can use HTML::Treebuilder, and scan the tree of element
objects that you get back.
=back
The last approach, using HTML::TreeBuilder, is the diametric opposite of
first approach: The first approach involves just elementary Perl and one
regexp, whereas the TreeBuilder approach involves being at home with
the concept of tree-shaped data structures and modules with
object-oriented interfaces, as well as with the particular interfaces
that HTML::TreeBuilder and HTML::Element provide.
However, what the TreeBuilder approach has going for it is that it's
the most robust, because it involves dealing with HTML in its "native"
format -- it deals with the tree structure that HTML code represents,
without any consideration of how the source is coded and with what
tags omitted.
So, to extract the text from the "h1" elements of an HTML document:
sub get_heading {
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($_[0]); # !
my $heading;
my $h1 = $tree->look_down('_tag', 'h1'); # !
if($h1) {
$heading = $h1->as_text; # !
} else {
warn "No heading in $_[0]?";
}
$tree->delete; # clear memory!
return $heading;
}
This uses some unfamiliar methods that need explaining. The
C<parse_file> method that we've seen before, builds a tree based on
source from the file given. The C<delete> method is for marking a
tree's contents as available for garbage collection, when you're done
with the tree. The C<as_text> method returns a string that contains
all the text bits that are children (or otherwise descendants) of the
given node -- to get the text content of the C<$h1> object, we could
just say:
$heading = join '', $h1->content_list;
but that will work only if we're sure that the "h1" element's children
will be only text bits -- if the document contained:
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
then the sub-tree would be:
. h1
. "Local Man Sees "
. cite
. "Blade"
. " Again'
so C<join '', $h1-E<gt>content_list> will be something like:
Local Man Sees HTML::Element=HASH(0x15424040) Again
whereas C<$h1-E<gt>as_text> would yield:
Local Man Sees Blade Again
and depending on what you're doing with the heading text, you might
want the C<as_HTML> method instead. It returns the (sub)tree
represented as HTML source. C<$h1-E<gt>as_HTML> would yield:
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
However, if you wanted the contents of C<$h1> as HTML, but not the
C<$h1> itself, you could say:
join '',
map(
ref($_) ? $_->as_HTML : $_,
$h1->content_list
)
This C<map> iterates over the nodes in C<$h1>'s list of children; and
for each node that's just a text bit (as "Local Man Sees " is), it just
passes through that string value, and for each node that's an actual
object (causing C<ref> to be true), C<as_HTML> will used instead of the
string value of the object itself (which would be something quite
useless, as most object values are). So that C<as_HTML> for the "cite"
element will be the string "<cite>BladeE<lt>/cite>". And then,
finally, C<join> just puts into one string all the strings that the
C<map> returns.
Last but not least, the most important method in our C<get_heading> sub
is the C<look_down> method. This method looks down at the subtree
starting at the given object (C<$h1>), looking for elements that meet
criteria you provide.
The criteria are specified in the method's argument list. Each
criterion can consist of two scalars, a key and a value, which express
that you want elements that have that attribute (like "_tag", or
"src") with the given value ("h1"); or the criterion can be a
reference to a subroutine that, when called on the given element,
returns true if that is a node you're looking for. If you specify
several criteria, then that's taken to mean that you want all the
elements that each satisfy I<all> the criteria. (In other words,
there's an "implicit AND".)
And finally, there's a bit of an optimization -- if you call the
C<look_down> method in a scalar context, you get just the I<first> node
(or undef if none) -- and, in fact, once C<look_down> finds that first
matching element, it doesn't bother looking any further.
So the example:
$h1 = $tree->look_down('_tag', 'h1');
returns the first element at-or-under C<$tree> whose C<"_tag">
attribute has the value C<"h1">.
=head2 Complex Criteria in Tree Scanning
Now, the above C<look_down> code looks like a lot of bother, with
barely more benefit than just grepping the file! But consider if your
criteria were more complicated -- suppose you found that some of the
press releases that you were scanning had several "h1" elements,
possibly before or after the one you actually want. For example:
<h1><center>Visit Our Corporate Partner
<br><a href="/dyna/clickthru"
><img src="/dyna/vend_ad"></a>
</center></h1>
<h1><center>ConGlomCo President Schreck to Visit Regional HQ
<br><a href="/photos/Schreck_visit_large.jpg"
><img src="/photos/Schreck_visit.jpg"></a>
</center></h1>
Here, you want to ignore the first "h1" element because it contains an
ad, and you want the text from the second "h1". The problem is in
formalizing the way you know that it's an ad. Since ad banners are
always entreating you to "visit" the sponsoring site, you could exclude
"h1" elements that contain the word "visit" under them:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
$_[0]->as_text !~ m/\bvisit/i
}
);
The first criterion looks for "h1" elements, and the second criterion
limits those to only the ones whose text content doesn't match
C<m/\bvisit/>. But unfortunately, that won't work for our example,
since the second "h1" mentions "ConGlomCo President Schreck to
I<Visit> Regional HQ".
Instead you could try looking for the first "h1" element that
doesn't contain an image:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
not $_[0]->look_down('_tag', 'img')
}
);
This criterion sub might seem a bit odd, since it calls C<look_down>
as part of a larger C<look_down> operation, but that's fine. Note that
when considered as a boolean value, a C<look_down> in a scalar context
value returns false (specifically, undef) if there's no matching element
at or under the given element; and it returns the first matching
element (which, being a reference and object, is always a true value),
if any matches. So, here,
sub {
not $_[0]->look_down('_tag', 'img')
}
means "return true only if this element has no 'img' element as
descendants (and isn't an 'img' element itself)."
This correctly filters out the first "h1" that contains the ad, but it
also incorrectly filters out the second "h1" that contains a
non-advertisement photo besides the headline text you want.
There clearly are detectable differences between the first and second
"h1" elements -- the only second one contains the string "Schreck", and
we could just test for that:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
$_[0]->as_text =~ m{Schreck}
}
);
And that works fine for this one example, but unless all thousand of
your press releases have "Schreck" in the headline, that's just not a
general solution. However, if all the ads-in-"h1"s that you want to
exclude involve a link whose URL involves "/dyna/", then you can use
that:
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
my $link = $_[0]->look_down('_tag','a');
return 1 unless $link;
# no link means it's fine
return 0 if $link->attr('href') =~ m{/dyna/};
# a link to there is bad
return 1; # otherwise okay
}
);
Or you can look at it another way and say that you want the first "h1"
element that either contains no images, or else whose image has a "src"
attribute whose value contains "/photos/":
my $real_h1 = $tree->look_down(
'_tag', 'h1',
sub {
my $img = $_[0]->look_down('_tag','img');
return 1 unless $img;
# no image means it's fine
return 1 if $img->attr('src') =~ m{/photos/};
# good if a photo
return 0; # otherwise bad
}
);
Recall that this use of C<look_down> in a scalar context means to return
the first element at or under C<$tree> that matches all the criteria.
But if you notice that you can formulate criteria that'll match several
possible "h1" elements, some of which may be bogus but the I<last> one
of which is always the one you want, then you can use C<look_down> in a
list context, and just use the last element of that list:
my @h1s = $tree->look_down(
'_tag', 'h1',
...maybe more criteria...
);
die "What, no h1s here?" unless @h1s;
my $real_h1 = $h1s[-1]; # last or only
=head2 A Case Study: Scanning Yahoo News's HTML
The above (somewhat contrived) case involves extracting data from a
bunch of pre-existing HTML files. In that sort of situation, if your
code works for all the files, then you know that the code I<works> --
since the data it's meant to handle won't go changing or growing; and,
typically, once you've used the program, you'll never need to use it
again.
The other kind of situation faced in many data extraction tasks is
where the program is used recurringly to handle new data -- such as
from ever-changing Web pages. As a real-world example of this,
consider a program that you could use (suppose it's crontabbed) to
extract headline-links from subsections of Yahoo News
(C<http://dailynews.yahoo.com/>).
Yahoo News has several subsections:
=over
=item http://dailynews.yahoo.com/h/tc/ for technology news
=item http://dailynews.yahoo.com/h/sc/ for science news
=item http://dailynews.yahoo.com/h/hl/ for health news
=item http://dailynews.yahoo.com/h/wl/ for world news
=item http://dailynews.yahoo.com/h/en/ for entertainment news
=back
and others. All of them are built on the same basic HTML template --
and a scarily complicated template it is, especially when you look at
it with an eye toward making up rules that will select where the real
headline-links are, while screening out all the links to other parts of
Yahoo, other news services, etc. You will need to puzzle
over the HTML source, and scrutinize the output of
C<$tree-E<gt>dump> on the parse tree of that HTML.
Sometimes the only way to pin down what you're after is by position in
the tree. For example, headlines of interest may be in the third
column of the second row of the second table element in a page:
my $table = ( $tree->look_down('_tag','table') )[1];
my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
my $col3 = ( $row2->look-down('_tag', 'td') )[2];
...then do things with $col3...
Or they may be all the links in a "p" element that has at least three
"br" elements as children:
my $p = $tree->look_down(
'_tag', 'p',
sub {
2 < grep { ref($_) and $_->tag eq 'br' }
$_[0]->content_list
}
);
@links = $p->look_down('_tag', 'a');
But almost always, you can get away with looking for properties of the
of the thing itself, rather than just looking for contexts. Now, if
you're lucky, the document you're looking through has clear semantic
tagging, such is as useful in CSS -- note the
class="headlinelink" bit here:
<a href="...long_news_url..." class="headlinelink">Elvis
seen in tortilla</a>
If you find anything like that, you could leap right in and select
links with:
@links = $tree->look_down('class','headlinelink');
Regrettably, your chances of seeing any sort of semantic markup
principles really being followed with actual HTML are pretty thin.
=over
Footnote:
In fact, your chances of finding a page that is simply free of HTML
errors are even thinner. And surprisingly, sites like Amazon or Yahoo
are typically worse as far as quality of code than personal sites
whose entire production cycle involves simply being saved and uploaded
from Netscape Composer.
=back
The code may be sort of "accidentally semantic", however -- for example,
in a set of pages I was scanning recently, I found that looking for
"td" elements with a "width" attribute value of "375" got me exactly
what I wanted. No-one designing that page ever conceived of
"width=375" as I<meaning> "this is a headline", but if you impute it
to mean that, it works.
An approach like this happens to work for the Yahoo News code, because
the headline-links are distinguished by the fact that they (and they
alone) contain a "b" element:
<a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
or, diagrammed as a part of the parse tree:
. a [href="...long_news_url..."]
. b
. "Elvis seen in tortilla"
A rule that matches these can be formalized as "look for any 'a'
element that has only one daughter node, which must be a 'b' element".
And this is what it looks like when cooked up as a C<look_down>
expression and prefaced with a bit of code that retrieves the text of
the given Yahoo News page and feeds it to TreeBuilder:
use strict;
use HTML::TreeBuilder 2.97;
use LWP::UserAgent;
sub get_headlines {
my $url = $_[0] || die "What URL?";
my $response = LWP::UserAgent->new->request(
HTTP::Request->new( GET => $url )
);
unless($response->is_success) {
warn "Couldn't get $url: ", $response->status_line, "\n";
return;
}
my $tree = HTML::TreeBuilder->new();
$tree->parse($response->content);
$tree->eof;
my @out;
foreach my $link (
$tree->look_down( # !
'_tag', 'a',
sub {
return unless $_[0]->attr('href');
my @c = $_[0]->content_list;
@c == 1 and ref $c[0] and $c[0]->tag eq 'b';
}
)
) {
push @out, [ $link->attr('href'), $link->as_text ];
}
warn "Odd, fewer than 6 stories in $url!" if @out < 6;
$tree->delete;
return @out;
}
...and add a bit of code to actually call that routine and display the
results...
foreach my $section (qw[tc sc hl wl en]) {
my @links = get_headlines(
"http://dailynews.yahoo.com/h/$section/"
);
print
$section, ": ", scalar(@links), " stories\n",
map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
"\n";
}
And we've got our own headline-extractor service! This in and of
itself isn't no amazingly useful (since if you want to see the
headlines, you I<can> just look at the Yahoo News pages), but it could
easily be the basis for quite useful features like filtering the
headlines for matching certain keywords of interest to you.
Now, one of these days, Yahoo News will decide to change its HTML
template. When this happens, this will appear to the above program as
there being no links that meet the given criteria; or, less likely,
dozens of erroneous links will meet the criteria. In either case, the
criteria will have to be changed for the new template; they may just
need adjustment, or you may need to scrap them and start over.
=head2 I<Regardez, duvet!>
It's often quite a challenge to write criteria to match the desired
parts of an HTML parse tree. Very often you I<can> pull it off with a
simple C<$tree-E<gt>look_down('_tag', 'h1')>, but sometimes you do
have to keep adding and refining criteria, until you might end up with
complex filters like what I've shown in this article. The
benefit to learning how to deal with HTML parse trees is that one main
search tool, the C<look_down> method, can do most of the work, making
simple things easy, while still making hard things possible.
B<[end body of article]>
=head2 [Author Credit]
Sean M. Burke (C<sburke@cpan.org>) is the current maintainer of
C<HTML::TreeBuilder> and C<HTML::Element>, both originally by
Gisle Aas.
Sean adds: "I'd like to thank the folks who listened to me ramble
incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet
Another Perl Conference and O'Reilly Open Source Software Convention."
=head1 BACK
Return to the L<HTML::Tree|HTML::Tree> docs.
=cut

File diff suppressed because it is too large Load Diff