Initial Commit
This commit is contained in:
686
database/perl/vendor/lib/HTML/Tree/AboutObjects.pod
vendored
Normal file
686
database/perl/vendor/lib/HTML/Tree/AboutObjects.pod
vendored
Normal file
@@ -0,0 +1,686 @@
|
||||
|
||||
#Time-stamp: "2001-02-23 20:07:25 MST" -*-Text-*-
|
||||
# This document contains text in Perl "POD" format.
|
||||
# Use a POD viewer like perldoc or perlman to render it.
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tree::AboutObjects -- article: "User's View of Object-Oriented Modules"
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
# This an article, not a module.
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The following article by Sean M. Burke first appeared in I<The Perl
|
||||
Journal> #17 and is copyright 2000 The Perl Journal. It appears
|
||||
courtesy of Jon Orwant and The Perl Journal. This document may be
|
||||
distributed under the same terms as Perl itself.
|
||||
|
||||
=head1 A User's View of Object-Oriented Modules
|
||||
|
||||
-- Sean M. Burke
|
||||
|
||||
The first time that most Perl programmers run into object-oriented
|
||||
programming when they need to use a module whose interface is
|
||||
object-oriented. This is often a mystifying experience, since talk of
|
||||
"methods" and "constructors" is unintelligible to programmers who
|
||||
thought that functions and variables was all there was to worry about.
|
||||
|
||||
Articles and books that explain object-oriented programming (OOP), do so
|
||||
in terms of how to program that way. That's understandable, and if you
|
||||
learn to write object-oriented code of your own, you'd find it easy to
|
||||
use object-oriented code that others write. But this approach is the
|
||||
I<long> way around for people whose immediate goal is just to use
|
||||
existing object-oriented modules, but who don't yet want to know all the
|
||||
gory details of having to write such modules for themselves.
|
||||
|
||||
This article is for those programmers -- programmers who want to know
|
||||
about objects from the perspective of using object-oriented modules.
|
||||
|
||||
=head2 Modules and Their Functional Interfaces
|
||||
|
||||
Modules are the main way that Perl provides for bundling up code for
|
||||
later use by yourself or others. As I'm sure you can't help noticing
|
||||
from reading
|
||||
I<The Perl Journal>, CPAN (the Comprehensive Perl Archive
|
||||
Network) is the repository for modules (or groups of modules) that
|
||||
others have written, to do anything from composing music to accessing
|
||||
Web pages. A good deal of those modules even come with every
|
||||
installation of Perl.
|
||||
|
||||
One module that you may have used before, and which is fairly typical in
|
||||
its interface, is Text::Wrap. It comes with Perl, so you don't even
|
||||
need to install it from CPAN. You use it in a program of yours, by
|
||||
having your program code say early on:
|
||||
|
||||
use Text::Wrap;
|
||||
|
||||
and after that, you can access a function called C<wrap>, which inserts
|
||||
line-breaks in text that you feed it, so that the text will be wrapped to
|
||||
seventy-two (or however many) columns.
|
||||
|
||||
The way this C<use Text::Wrap> business works is that the module
|
||||
Text::Wrap exists as a file "Text/Wrap.pm" somewhere in one of your
|
||||
library directories. That file contains Perl code...
|
||||
|
||||
=over
|
||||
|
||||
Footnote: And mixed in with the Perl code, there's documentation, which
|
||||
is what you read with "perldoc Text::Wrap". The perldoc program simply
|
||||
ignores the code and formats the documentation text, whereas "use
|
||||
Text::Wrap" loads and runs the code while ignoring the documentation.
|
||||
|
||||
=back
|
||||
|
||||
...which, among other things, defines a function called C<Text::Wrap::wrap>,
|
||||
and then C<exports> that function, which means that when you say C<wrap>
|
||||
after having said "use Text::Wrap", you'll be actually calling the
|
||||
C<Text::Wrap::wrap> function. Some modules don't export their
|
||||
functions, so you have to call them by their full name, like
|
||||
C<Text::Wrap::wrap(...parameters...)>.
|
||||
|
||||
Regardless of whether the typical module exports the functions it
|
||||
provides, a module is basically just a container for chunks of code that
|
||||
do useful things. The way the module allows for you to interact with
|
||||
it, is its I<interface>. And when, like with Text::Wrap, its interface
|
||||
consists of functions, the module is said to have a B<functional
|
||||
interface>.
|
||||
|
||||
=over
|
||||
|
||||
Footnote: the term "function" (and therefore "functionI<al>") has
|
||||
various senses. I'm using the term here in its broadest sense, to
|
||||
refer to routines -- bits of code that are called by some name and
|
||||
which take parameters and return some value.
|
||||
|
||||
=back
|
||||
|
||||
Using modules with functional interfaces is straightforward -- instead
|
||||
of defining your own "wrap" function with C<sub wrap { ... }>, you
|
||||
entrust "use Text::Wrap" to do that for you, along with whatever other
|
||||
functions its defines and exports, according to the module's
|
||||
documentation. Without too much bother, you can even write your own
|
||||
modules to contain your frequently used functions; I suggest having a look at
|
||||
the C<perlmod> man page for more leads on doing this.
|
||||
|
||||
=head2 Modules with Object-Oriented Interfaces
|
||||
|
||||
So suppose that one day you want to write a program that will automate
|
||||
the process of C<ftp>ing a bunch of files from one server down to your
|
||||
local machine, and then off to another server.
|
||||
|
||||
A quick browse through search.cpan.org turns up the module "Net::FTP",
|
||||
which you can download and install it using normal installation
|
||||
instructions (unless your sysadmin has already installed it, as many
|
||||
have).
|
||||
|
||||
Like Text::Wrap or any other module with a familiarly functional
|
||||
interface, you start off using Net::FTP in your program by saying:
|
||||
|
||||
use Net::FTP;
|
||||
|
||||
However, that's where the similarity ends. The first hint of
|
||||
difference is that the documentation for Net::FTP refers to it as a
|
||||
B<class>. A class is a kind of module, but one that has an
|
||||
object-oriented interface.
|
||||
|
||||
Whereas modules like Text::Wrap
|
||||
provide bits of useful code as I<functions>, to be called like
|
||||
C<function(...parameters...)> or like
|
||||
C<PackageName::function(...parameters...)>, Net::FTP and other modules
|
||||
with object-oriented interfaces provide B<methods>. Methods are sort of
|
||||
like functions in that they have a name and parameters; but methods
|
||||
look different, and are different, because you have to call them with a
|
||||
syntax that has a class name or an object as a special argument. I'll
|
||||
explain the syntax for method calls, and then later explain what they
|
||||
all mean.
|
||||
|
||||
Some methods are meant to be called as B<class methods>, with the class
|
||||
name (same as the module name) as a special argument. Class methods
|
||||
look like this:
|
||||
|
||||
ClassName->methodname(parameter1, parameter2, ...)
|
||||
ClassName->methodname() # if no parameters
|
||||
ClassName->methodname # same as above
|
||||
|
||||
which you will sometimes see written:
|
||||
|
||||
methodname ClassName (parameter1, parameter2, ...)
|
||||
methodname ClassName # if no parameters
|
||||
|
||||
Basically all class methods are for making new objects, and methods that
|
||||
make objects are called "B<constructors>" (and the process of making them
|
||||
is called "constructing" or "instantiating"). Constructor methods
|
||||
typically have the name "new", or something including "new"
|
||||
("new_from_file", etc.); but they can conceivably be named
|
||||
anything -- DBI's constructor method is named "connect", for example.
|
||||
|
||||
The object that a constructor method returns is
|
||||
typically captured in a scalar variable:
|
||||
|
||||
$object = ClassName->new(param1, param2...);
|
||||
|
||||
Once you have an object (more later on exactly what that is), you can
|
||||
use the other kind of method call syntax, the syntax for B<object method>
|
||||
calls. Calling object methods is just like class methods, except
|
||||
that instead of the ClassName as the special argument,
|
||||
you use an expression that yeilds an "object". Usually this is
|
||||
just a scalar variable that you earlier captured the
|
||||
output of the constructor in. Object method calls look like this:
|
||||
|
||||
$object->methodname(parameter1, parameter2, ...);
|
||||
$object->methodname() # if no parameters
|
||||
$object->methodname # same as above
|
||||
|
||||
which is occasionally written as:
|
||||
|
||||
methodname $object (parameter1, parameter2, ...)
|
||||
methodname $object # if no parameters
|
||||
|
||||
Examples of method calls are:
|
||||
|
||||
my $session1 = Net::FTP->new("ftp.myhost.com");
|
||||
# Calls a class method "new", from class Net::FTP,
|
||||
# with the single parameter "ftp.myhost.com",
|
||||
# and saves the return value (which is, as usual,
|
||||
# an object), in $session1.
|
||||
# Could also be written:
|
||||
# new Net::FTP('ftp.myhost.com')
|
||||
$session1->login("sburke","aoeuaoeu")
|
||||
|| die "failed to login!\n";
|
||||
# calling the object method "login"
|
||||
print "Dir:\n", $session1->dir(), "\n";
|
||||
$session1->quit;
|
||||
# same as $session1->quit()
|
||||
print "Done\n";
|
||||
exit;
|
||||
|
||||
Incidentally, I suggest always using the syntaxes with parentheses and
|
||||
"->" in them,
|
||||
|
||||
=over
|
||||
|
||||
Footnote: the character-pair "->" is supposed to look like an
|
||||
arrow, not "negative greater-than"!
|
||||
|
||||
=back
|
||||
|
||||
and avoiding the syntaxes that start out "methodname $object" or
|
||||
"methodname ModuleName". When everything's going right, they all mean
|
||||
the same thing as the "->" variants, but the syntax with "->" is more
|
||||
visually distinct from function calls, as well as being immune to some
|
||||
kinds of rare but puzzling ambiguities that can arise when you're trying
|
||||
to call methods that have the same name as subroutines you've defined.
|
||||
|
||||
But, syntactic alternatives aside, all this talk of constructing objects
|
||||
and object methods begs the question -- what I<is> an object? There are
|
||||
several angles to this question that the rest of this article will
|
||||
answer in turn: what can you do with objects? what's in an object?
|
||||
what's an object value? and why do some modules use objects at all?
|
||||
|
||||
=head2 What Can You Do with Objects?
|
||||
|
||||
You've seen that you can make objects, and call object methods with
|
||||
them. But what are object methods for? The answer depends on the class:
|
||||
|
||||
A Net::FTP object represents a session between your computer and an FTP
|
||||
server. So the methods you call on a Net::FTP object are for doing
|
||||
whatever you'd need to do across an FTP connection. You make the
|
||||
session and log in:
|
||||
|
||||
my $session = Net::FTP->new('ftp.aol.com');
|
||||
die "Couldn't connect!" unless defined $session;
|
||||
# The class method call to "new" will return
|
||||
# the new object if it goes OK, otherwise it
|
||||
# will return undef.
|
||||
|
||||
$session->login('sburke', 'p@ssw3rD')
|
||||
|| die "Did I change my password again?";
|
||||
# The object method "login" will give a true
|
||||
# return value if actually logs in, otherwise
|
||||
# it'll return false.
|
||||
|
||||
You can use the session object to change directory on that session:
|
||||
|
||||
$session->cwd("/home/sburke/public_html")
|
||||
|| die "Hey, that was REALLY supposed to work!";
|
||||
# if the cwd fails, it'll return false
|
||||
|
||||
...get files from the machine at the other end of the session...
|
||||
|
||||
foreach my $f ('log_report_ua.txt', 'log_report_dom.txt',
|
||||
'log_report_browsers.txt')
|
||||
{
|
||||
$session->get($f) || warn "Getting $f failed!"
|
||||
};
|
||||
|
||||
...and plenty else, ending finally with closing the connection:
|
||||
|
||||
$session->quit();
|
||||
|
||||
In short, object methods are for doing things related to (or with)
|
||||
whatever the object represents. For FTP sessions, it's about sending
|
||||
commands to the server at the other end of the connection, and that's
|
||||
about it -- there, methods are for doing something to the world outside
|
||||
the object, and the objects is just something that specifies what bit
|
||||
of the world (well, what FTP session) to act upon.
|
||||
|
||||
With most other classes, however, the object itself stores some kind of
|
||||
information, and it typically makes no sense to do things with such an
|
||||
object without considering the data that's in the object.
|
||||
|
||||
=head2 What's I<in> an Object?
|
||||
|
||||
An object is (with rare exceptions) a data structure containing a
|
||||
bunch of attributes, each of which has a value, as well as a name
|
||||
that you use when you
|
||||
read or set the attribute's value. Some of the object's attributes are
|
||||
private, meaning you'll never see them documented because they're not
|
||||
for you to read or write; but most of the object's documented attributes
|
||||
are at least readable, and usually writeable, by you. Net::FTP objects
|
||||
are a bit thin on attributes, so we'll use objects from the class
|
||||
Business::US_Amort for this example. Business::US_Amort is a very
|
||||
simple class (available from CPAN) that I wrote for making calculations
|
||||
to do with loans (specifically, amortization, using US-style
|
||||
algorithms).
|
||||
|
||||
An object of the class Business::US_Amort represents a loan with
|
||||
particular parameters, i.e., attributes. The most basic attributes of a
|
||||
"loan object" are its interest rate, its principal (how much money it's
|
||||
for), and it's term (how long it'll take to repay). You need to set
|
||||
these attributes before anything else can be done with the object. The
|
||||
way to get at those attributes for loan objects is just like the
|
||||
way to get at attributes for any class's objects: through accessors.
|
||||
An B<accessor> is simply any method that accesses (whether reading or
|
||||
writing, AKA getting or putting) some attribute in the given object.
|
||||
Moreover, accessors are the B<only> way that you can change
|
||||
an object's attributes. (If a module's documentation wants you to
|
||||
know about any other way, it'll tell you.)
|
||||
|
||||
Usually, for simplicity's sake, an accessor is named after the attribute
|
||||
it reads or writes. With Business::US_Amort objects, the accessors you
|
||||
need to use first are C<principal>, C<interest_rate>, and C<term>.
|
||||
Then, with at least those attributes set, you can call the C<run> method
|
||||
to figure out several things about the loan. Then you can call various
|
||||
accessors, like C<total_paid_toward_interest>, to read the results:
|
||||
|
||||
use Business::US_Amort;
|
||||
my $loan = Business::US_Amort->new;
|
||||
# Set the necessary attributes:
|
||||
$loan->principal(123654);
|
||||
$loan->interest_rate(9.25);
|
||||
$loan->term(20); # twenty years
|
||||
|
||||
# NOW we know enough to calculate:
|
||||
$loan->run;
|
||||
|
||||
# And see what came of that:
|
||||
print
|
||||
"Total paid toward interest: A WHOPPING ",
|
||||
$loan->total_paid_interest, "!!\n";
|
||||
|
||||
This illustrates a convention that's common with accessors: calling the
|
||||
accessor with no arguments (as with $loan->total_paid_interest) usually
|
||||
means to read the value of that attribute, but providing a value (as
|
||||
with $loan->term(20)) means you want that attribute to be set to that
|
||||
value. This stands to reason: why would you be providing a value, if
|
||||
not to set the attribute to that value?
|
||||
|
||||
Although a loan's term, principal, and interest rates are all single
|
||||
numeric values, an objects values can any kind of scalar, or an array,
|
||||
or even a hash. Moreover, an attribute's value(s) can be objects
|
||||
themselves. For example, consider MIDI files (as I wrote about in
|
||||
TPJ#13): a MIDI file usually consists of several tracks. A MIDI file is
|
||||
complex enough to merit being an object with attributes like its overall
|
||||
tempo, the file-format variant it's in, and the list of instrument
|
||||
tracks in the file. But tracks themselves are complex enough to be
|
||||
objects too, with attributes like their track-type, a list of MIDI
|
||||
commands if they're a MIDI track, or raw data if they're not. So I
|
||||
ended up writing the MIDI modules so that the "tracks" attribute of a
|
||||
MIDI::Opus object is an array of objects from the class MIDI::Track.
|
||||
This may seem like a runaround -- you ask what's in one object, and get
|
||||
I<another> object, or several! But in this case, it exactly reflects
|
||||
what the module is for -- MIDI files contain MIDI tracks, which then
|
||||
contain data.
|
||||
|
||||
=head2 What is an Object Value?
|
||||
|
||||
When you call a constructor like Net::FTP->new(I<hostname>), you get
|
||||
back an object value, a value you can later use, in combination with a
|
||||
method name, to call object methods.
|
||||
|
||||
Now, so far we've been pretending, in the above examples, that the
|
||||
variables $session or $loan I<are> the objects you're dealing with.
|
||||
This idea is innocuous up to a point, but it's really a misconception
|
||||
that will, at best, limit you in what you know how to do. The reality
|
||||
is not that the variables $session or $query are objects; it's a little
|
||||
more indirect -- they I<hold> values that symbolize objects. The kind of
|
||||
value that $session or $query hold is what I'm calling an object value.
|
||||
|
||||
To understand what kind of value this is, first think about the other
|
||||
kinds of scalar values you know about: The first two scalar values you
|
||||
probably ever ran into in Perl are B<numbers> and B<strings>, which you
|
||||
learned (or just assumed) will usually turn into each other on demand;
|
||||
that is, the three-character string "2.5" can become the quantity two
|
||||
and a half, and vice versa. Then, especially if you started using
|
||||
C<perl -w> early on, you learned about the B<undefined value>, which can
|
||||
turn into 0 if you treat it as a number, or the empty-string if you
|
||||
treat it as a string.
|
||||
|
||||
=over
|
||||
|
||||
Footnote: You may I<also> have been learning about references, in which
|
||||
case you're ready to hear that object values are just a kind of
|
||||
reference, except that they reflect the class that created thing they point
|
||||
to, instead of merely being a plain old array reference, hash reference,
|
||||
etc. I<If> this makes makes sense to you, and you want to know more
|
||||
about how objects are implemented in Perl, have a look at the
|
||||
C<perltoot> man page.
|
||||
|
||||
=back
|
||||
|
||||
And now you're learning about B<object values>. An object value is a
|
||||
value that points to a data structure somewhere in memory, which is
|
||||
where all the attributes for this object are stored. That data
|
||||
structure as a whole belongs to a class (probably the one you named in
|
||||
the constructor method, like ClassName->new), so that the object value
|
||||
can be used as part of object method calls.
|
||||
|
||||
If you want to actually I<see> what an object value is, you might try
|
||||
just saying "print $object". That'll get you something like this:
|
||||
|
||||
Net::FTP=GLOB(0x20154240)
|
||||
|
||||
or
|
||||
|
||||
Business::US_Amort=HASH(0x15424020)
|
||||
|
||||
That's not very helpful if you wanted to really get at the object's
|
||||
insides, but that's because the object value is only a symbol for the
|
||||
object. This may all sound very abstruse and metaphysical, so a
|
||||
real-world allegory might be very helpful:
|
||||
|
||||
=over
|
||||
|
||||
You get an advertisement in the mail saying that you have been
|
||||
(im)personally selected to have the rare privilege of applying for a
|
||||
credit card. For whatever reason, I<this> offer sounds good to you, so you
|
||||
fill out the form and mail it back to the credit card company. They
|
||||
gleefully approve the application and create your account, and send you
|
||||
a card with a number on it.
|
||||
|
||||
Now, you can do things with the number on that card -- clerks at stores
|
||||
can ring up things you want to buy, and charge your account by keying in
|
||||
the number on the card. You can pay for things you order online by
|
||||
punching in the card number as part of your online order. You can pay
|
||||
off part of the account by sending the credit card people some of your
|
||||
money (well, a check) with some note (usually the pre-printed slip)
|
||||
that has the card number for the account you want to pay toward. And you
|
||||
should be able to call the credit card company's computer and ask it
|
||||
things about the card, like its balance, its credit limit, its APR, and
|
||||
maybe an itemization of recent purchases ad payments.
|
||||
|
||||
Now, what you're I<really> doing is manipulating a credit card
|
||||
I<account>, a completely abstract entity with some data attached to it
|
||||
(balance, APR, etc). But for ease of access, you have a credit card
|
||||
I<number> that is a symbol for that account. Now, that symbol is just a
|
||||
bunch of digits, and the number is effectively meaningless and useless
|
||||
in and of itself -- but in the appropriate context, it's understood to
|
||||
I<mean> the credit card account you're accessing.
|
||||
|
||||
=back
|
||||
|
||||
This is exactly the relationship between objects and object values, and
|
||||
from this analogy, several facts about object values are a bit more
|
||||
explicable:
|
||||
|
||||
* An object value does nothing in and of itself, but it's useful when
|
||||
you use it in the context of an $object->method call, the same way that
|
||||
a card number is useful in the context of some operation dealing with a
|
||||
card account.
|
||||
|
||||
Moreover, several copies of the same object value all refer to the same
|
||||
object, the same way that making several copies of your card number
|
||||
won't change the fact that they all still refer to the same single
|
||||
account (this is true whether you're "copying" the number by just
|
||||
writing it down on different slips of paper, or whether you go to the
|
||||
trouble of forging exact replicas of your own plastic credit card). That's
|
||||
why this:
|
||||
|
||||
$x = Net::FTP->new("ftp.aol.com");
|
||||
$x->login("sburke", "aoeuaoeu");
|
||||
|
||||
does the same thing as this:
|
||||
|
||||
$x = Net::FTP->new("ftp.aol.com");
|
||||
$y = $x;
|
||||
$z = $y;
|
||||
$z->login("sburke", "aoeuaoeu");
|
||||
|
||||
That is, $z and $y and $x are three different I<slots> for values,
|
||||
but what's in those slots are all object values pointing to the same
|
||||
object -- you don't have three different FTP connections, just three
|
||||
variables with values pointing to the some single FTP connection.
|
||||
|
||||
* You can't tell much of anything about the object just by looking at
|
||||
the object value, any more than you can see your credit account balance
|
||||
by holding the plastic card up to the light, or by adding up the digits
|
||||
in your credit card number.
|
||||
|
||||
* You can't just make up your own object values and have them work --
|
||||
they can come only from constructor methods of the appropriate class.
|
||||
Similarly, you get a credit card number I<only> by having a bank approve
|
||||
your application for a credit card account -- at which point I<they>
|
||||
let I<you> know what the number of your new card is.
|
||||
|
||||
Now, there's even more to the fact that you can't just make up your own
|
||||
object value: even though you can print an object value and get a string
|
||||
like "Net::FTP=GLOB(0x20154240)", that string is just a
|
||||
I<representation> of an object value.
|
||||
|
||||
Internally, an object value has a basically different type from a
|
||||
string, or a number, or the undefined value -- if $x holds a real
|
||||
string, then that value's slot in memory says "this is a value of type
|
||||
I<string>, and its characters are...", whereas if it's an object value,
|
||||
the value's slot in memory says, "this is a value of type I<reference>,
|
||||
and the location in memory that it points to is..." (and by looking at
|
||||
what's at that location, Perl can tell the class of what's there).
|
||||
|
||||
Perl programmers typically don't have to think about all these details
|
||||
of Perl's internals. Many other languages force you to be more
|
||||
conscious of the differences between all of these (and also between
|
||||
types of numbers, which are stored differently depending on their size
|
||||
and whether they have fractional parts). But Perl does its best to
|
||||
hide the different types of scalars from you -- it turns numbers into
|
||||
strings and back as needed, and takes the string or number
|
||||
representation of undef or of object values as needed. However, you
|
||||
can't go from a string representation of an object value, back to an
|
||||
object value. And that's why this doesn't work:
|
||||
|
||||
$x = Net::FTP->new('ftp.aol.com');
|
||||
$y = Net::FTP->new('ftp.netcom.com');
|
||||
$z = Net::FTP->new('ftp.qualcomm.com');
|
||||
$all = join(' ', $x,$y,$z); # !!!
|
||||
...later...
|
||||
($aol, $netcom, $qualcomm) = split(' ', $all); # !!!
|
||||
$aol->login("sburke", "aoeuaoeu");
|
||||
$netcom->login("sburke", "qjkxqjkx");
|
||||
$qualcomm->login("smb", "dhtndhtn");
|
||||
|
||||
This fails because $aol ends up holding merely the B<string representation>
|
||||
of the object value from $x, not the object value itself -- when
|
||||
C<join> tried to join the characters of the "strings" $x, $y, and $z,
|
||||
Perl saw that they weren't strings at all, so it gave C<join> their
|
||||
string representations.
|
||||
|
||||
Unfortunately, this distinction between object values and their string
|
||||
representations doesn't really fit into the analogy of credit card
|
||||
numbers, because credit card numbers really I<are> numbers -- even
|
||||
thought they don't express any meaningful quantity, if you stored them
|
||||
in a database as a quantity (as opposed to just an ASCII string),
|
||||
that wouldn't stop them from being valid as credit card numbers.
|
||||
|
||||
This may seem rather academic, but there's there's two common mistakes
|
||||
programmers new to objects often make, which make sense only in terms of
|
||||
the distinction between object values and their string representations:
|
||||
|
||||
The first common error involves forgetting (or never having known in the
|
||||
first place) that when you go to use a value as a hash key, Perl uses
|
||||
the string representation of that value. When you want to use the
|
||||
numeric value two and a half as a key, Perl turns it into the
|
||||
three-character string "2.5". But if you then want to use that string
|
||||
as a number, Perl will treat it as meaning two and a half, so you're
|
||||
usually none the wiser that Perl converted the number to a string and
|
||||
back. But recall that Perl can't turn strings back into objects -- so
|
||||
if you tried to use a Net::FTP object value as a hash key, Perl actually
|
||||
used its string representation, like "Net::FTP=GLOB(0x20154240)", but
|
||||
that string is unusable as an object value. (Incidentally, there's
|
||||
a module Tie::RefHash that implements hashes that I<do> let you use
|
||||
real object-values as keys.)
|
||||
|
||||
The second common error with object values is in
|
||||
trying to save an object value to disk (whether printing it to a
|
||||
file, or storing it in a conventional database file). All you'll get is the
|
||||
string, which will be useless.
|
||||
|
||||
When you want to save an object and restore it later, you may find that
|
||||
the object's class already provides a method specifically for this. For
|
||||
example, MIDI::Opus provides methods for writing an object to disk as a
|
||||
standard MIDI file. The file can later be read back into memory by
|
||||
a MIDI::Opus constructor method, which will return a new MIDI::Opus
|
||||
object representing whatever file you tell it to read into memory.
|
||||
Similar methods are available with, for example, classes that
|
||||
manipulate graphic images and can save them to files, which can be read
|
||||
back later.
|
||||
|
||||
But some classes, like Business::US_Amort, provide no such methods for
|
||||
storing an object in a file. When this is the case, you can try
|
||||
using any of the Data::Dumper, Storable, or FreezeThaw modules. Using
|
||||
these will be unproblematic for objects of most classes, but it may run
|
||||
into limitations with others. For example, a Business::US_Amort
|
||||
object can be turned into a string with Data::Dumper, and that string
|
||||
written to a file. When it's restored later, its attributes will be
|
||||
accessible as normal. But in the unlikely case that the loan object was
|
||||
saved in mid-calculation, the calculation may not be resumable. This is
|
||||
because of the way that that I<particular> class does its calculations,
|
||||
but similar limitations may occur with objects from other classses.
|
||||
|
||||
But often, even I<wanting> to save an object is basically wrong -- what would
|
||||
saving an ftp I<session> even mean? Saving the hostname, username, and
|
||||
password? current directory on both machines? the local TCP/IP port
|
||||
number? In the case of "saving" a Net::FTP object, you're better off
|
||||
just saving whatever details you actually need for your own purposes,
|
||||
so that you can make a new object later and just set those values for it.
|
||||
|
||||
=head2 So Why Do Some Modules Use Objects?
|
||||
|
||||
All these details of using objects are definitely enough to make you
|
||||
wonder -- is it worth the bother? If you're a module author, writing
|
||||
your module with an object-oriented interface restricts the audience of
|
||||
potential users to those who understand the basic concepts of objects
|
||||
and object values, as well as Perl's syntax for calling methods. Why
|
||||
complicate things by having an object-oriented interface?
|
||||
|
||||
A somewhat esoteric answer is that a module has an object-oriented
|
||||
interface because the module's insides are written in an
|
||||
object-oriented style. This article is about the basics of
|
||||
object-oriented I<interfaces>, and it'd be going far afield to explain
|
||||
what object-oriented I<design> is. But the short story is that
|
||||
object-oriented design is just one way of attacking messy problems.
|
||||
It's a way that many programmers find very helpful (and which others
|
||||
happen to find to be far more of a hassle than it's worth,
|
||||
incidentally), and it just happens to show up for you, the module user,
|
||||
as merely the style of interface.
|
||||
|
||||
But a simpler answer is that a functional interface is sometimes a
|
||||
hindrance, because it limits the number of things you can do at once --
|
||||
limiting it, in fact, to one. For many problems that some modules are
|
||||
meant to solve, doing without an object-oriented interface would be like
|
||||
wishing that Perl didn't use filehandles. The ideas are rather simpler
|
||||
-- just imagine that Perl let you access files, but I<only> one at a
|
||||
time, with code like:
|
||||
|
||||
open("foo.txt") || die "Can't open foo.txt: $!";
|
||||
while(readline) {
|
||||
print $_ if /bar/;
|
||||
}
|
||||
close;
|
||||
|
||||
That hypothetical kind of Perl would be simpler, by doing without
|
||||
filehandles. But you'd be out of luck if you wanted to read from
|
||||
one file while reading from another, or read from two and print to a
|
||||
third.
|
||||
|
||||
In the same way, a functional FTP module would be fine for just
|
||||
uploading files to one server at a time, but it wouldn't allow you to
|
||||
easily write programs that make need to use I<several> simultaneous
|
||||
sessions (like "look at server A and server B, and if A has a file
|
||||
called X.dat, then download it locally and then upload it to server B --
|
||||
except if B has a file called Y.dat, in which case do it the other way
|
||||
around").
|
||||
|
||||
Some kinds of problems that modules solve just lend themselves to an
|
||||
object-oriented interface. For those kinds of tasks, a functional
|
||||
interface would be more familiar, but less powerful. Learning to use
|
||||
object-oriented modules' interfaces does require becoming comfortable
|
||||
with the concepts from this article. But in the end it will allow you
|
||||
to use a broader range of modules and, with them, to write programs
|
||||
that can do more.
|
||||
|
||||
B<[end body of article]>
|
||||
|
||||
=head2 [Author Credit]
|
||||
|
||||
Sean M. Burke has contributed several modules to CPAN, about half of
|
||||
them object-oriented.
|
||||
|
||||
[The next section should be in a greybox:]
|
||||
|
||||
=head2 The Gory Details
|
||||
|
||||
For sake of clarity of explanation, I had to oversimplify some of the
|
||||
facts about objects. Here's a few of the gorier details:
|
||||
|
||||
* Every example I gave of a constructor was a class method. But object
|
||||
methods can be constructors, too, if the class was written to work that
|
||||
way: $new = $old->copy, $node_y = $node_x->new_subnode, or the like.
|
||||
|
||||
* I've given the impression that there's two kinds of methods: object
|
||||
methods and class methods. In fact, the same method can be both,
|
||||
because it's not the kind of method it is, but the kind of calls it's
|
||||
written to accept -- calls that pass an object, or calls that pass a
|
||||
class-name.
|
||||
|
||||
* The term "object value" isn't something you'll find used much anywhere
|
||||
else. It's just my shorthand for what would properly be called an
|
||||
"object reference" or "reference to a blessed item". In fact, people
|
||||
usually say "object" when they properly mean a reference to that object.
|
||||
|
||||
* I mentioned creating objects with I<con>structors, but I didn't
|
||||
mention destroying them with I<de>structor -- a destructor is a kind of
|
||||
method that you call to tidy up the object once you're done with it, and
|
||||
want it to neatly go away (close connections, delete temporary files,
|
||||
free up memory, etc). But because of the way Perl handles memory,
|
||||
most modules won't require the user to know about destructors.
|
||||
|
||||
* I said that class method syntax has to have the class name, as in
|
||||
$session = B<Net::FTP>->new($host). Actually, you can instead use any
|
||||
expression that returns a class name: $ftp_class = 'Net::FTP'; $session
|
||||
= B<$ftp_class>->new($host). Moreover, instead of the method name for
|
||||
object- or class-method calls, you can use a scalar holding the method
|
||||
name: $foo->B<$method>($host). But, in practice, these syntaxes are
|
||||
rarely useful.
|
||||
|
||||
And finally, to learn about objects from the perspective of writing
|
||||
your own classes, see the C<perltoot> documentation,
|
||||
or Damian Conway's exhaustive and clear book I<Object Oriented Perl>
|
||||
(Manning Publications 1999, ISBN 1-884777-79-1).
|
||||
|
||||
=head1 BACK
|
||||
|
||||
Return to the L<HTML::Tree|HTML::Tree> docs.
|
||||
|
||||
=cut
|
||||
|
||||
1369
database/perl/vendor/lib/HTML/Tree/AboutTrees.pod
vendored
Normal file
1369
database/perl/vendor/lib/HTML/Tree/AboutTrees.pod
vendored
Normal file
File diff suppressed because it is too large
Load Diff
714
database/perl/vendor/lib/HTML/Tree/Scanning.pod
vendored
Normal file
714
database/perl/vendor/lib/HTML/Tree/Scanning.pod
vendored
Normal file
@@ -0,0 +1,714 @@
|
||||
|
||||
#Time-stamp: "2001-03-10 23:19:11 MST" -*-Text-*-
|
||||
# This document contains text in Perl "POD" format.
|
||||
# Use a POD viewer like perldoc or perlman to render it.
|
||||
|
||||
=head1 NAME
|
||||
|
||||
HTML::Tree::Scanning -- article: "Scanning HTML"
|
||||
|
||||
=head1 SYNOPSIS
|
||||
|
||||
# This an article, not a module.
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
The following article by Sean M. Burke first appeared in I<The Perl
|
||||
Journal> #19 and is copyright 2000 The Perl Journal. It appears
|
||||
courtesy of Jon Orwant and The Perl Journal. This document may be
|
||||
distributed under the same terms as Perl itself.
|
||||
|
||||
(Note that this is discussed in chapters 6 through 10 of the
|
||||
book I<Perl and LWP> L<http://lwp.interglacial.com/> which
|
||||
was written after the following documentation, and which is
|
||||
available free online.)
|
||||
|
||||
=head1 Scanning HTML
|
||||
|
||||
-- Sean M. Burke
|
||||
|
||||
In I<The Perl Journal> issue 17, Ken MacFarlane's article "Parsing
|
||||
HTML with HTML::Parser" describes how the HTML::Parser module scans
|
||||
HTML source as a stream of start-tags, end-tags, text, comments, etc.
|
||||
In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
|
||||
data structures. Now I'll try to tie it together, in a discussion of
|
||||
HTML trees.
|
||||
|
||||
The CPAN module HTML::TreeBuilder takes the
|
||||
tags that HTML::Parser picks out, and builds a parse tree -- a
|
||||
tree-shaped network of objects...
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
And if you need a quick explanation of objects, see my TPJ17 article "A
|
||||
User's View of Object-Oriented Modules"; or go whole hog and get Damian
|
||||
Conway's excellent book I<Object-Oriented Perl>, from Manning
|
||||
Publications.
|
||||
|
||||
=back
|
||||
|
||||
...representing the structured content of the HTML document. And once
|
||||
the document is parsed as a tree, you'll find the common tasks
|
||||
of extracting data from that HTML document/tree to be quite
|
||||
straightforward.
|
||||
|
||||
=head2 HTML::Parser, HTML::TreeBuilder, and HTML::Element
|
||||
|
||||
You use HTML::TreeBuilder to make a parse tree out of an HTML source
|
||||
file, by simply saying:
|
||||
|
||||
use HTML::TreeBuilder;
|
||||
my $tree = HTML::TreeBuilder->new();
|
||||
$tree->parse_file('foo.html');
|
||||
|
||||
and then C<$tree> contains a parse tree built from the HTML source from
|
||||
the file "foo.html". The way this parse tree is represented is with a
|
||||
network of objects -- C<$tree> is the root, an element with tag-name
|
||||
"html", and its children typically include a "head" and "body" element,
|
||||
and so on. Elements in the tree are objects of the class
|
||||
HTML::Element.
|
||||
|
||||
So, if you take this source:
|
||||
|
||||
<html><head><title>Doc 1</title></head>
|
||||
<body>
|
||||
Stuff <hr> 2000-08-17
|
||||
</body></html>
|
||||
|
||||
and feed it to HTML::TreeBuilder, it'll return a tree of objects that
|
||||
looks like this:
|
||||
|
||||
html
|
||||
/ \
|
||||
head body
|
||||
/ / | \
|
||||
title "Stuff" hr "2000-08-17"
|
||||
|
|
||||
"Doc 1"
|
||||
|
||||
This is a pretty simple document, but if it were any more complex,
|
||||
it'd be a bit hard to draw in that style, since it's sprawl left and
|
||||
right. The same tree can be represented a bit more easily sideways,
|
||||
with indenting:
|
||||
|
||||
. html
|
||||
. head
|
||||
. title
|
||||
. "Doc 1"
|
||||
. body
|
||||
. "Stuff"
|
||||
. hr
|
||||
. "2000-08-17"
|
||||
|
||||
Either way expresses the same structure. In that structure, the root
|
||||
node is an object of the class HTML::Element
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
Well actually, the root is of the class HTML::TreeBuilder, but that's
|
||||
just a subclass of HTML::Element, plus the few extra methods like
|
||||
C<parse_file> that elaborate the tree
|
||||
|
||||
=back
|
||||
|
||||
, with the tag name "html", and with two children: an HTML::Element
|
||||
object whose tag names are "head" and "body". And each of those
|
||||
elements have children, and so on down. Not all elements (as we'll
|
||||
call the objects of class HTML::Element) have children -- the "hr"
|
||||
element doesn't. And note all nodes in the tree are elements -- the
|
||||
text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
|
||||
|
||||
Objects of the class HTML::Element each have three noteworthy attributes:
|
||||
|
||||
=over
|
||||
|
||||
=item C<_tag> -- (best accessed as C<$e-E<gt>tag>)
|
||||
this element's tag-name, lowercased (e.g., "em" for an "em" element).
|
||||
|
||||
=over
|
||||
|
||||
Footnote: Yes, this is misnamed. In proper SGML terminology, this is
|
||||
instead called a "GI", short for "generic identifier"; and the term
|
||||
"tag" is used for a token of SGML source that represents either
|
||||
the start of an element (a start-tag like "<em lang='fr'>") or the end
|
||||
of an element (an end-tag like "</em>". However, since more people
|
||||
claim to have been abducted by aliens than to have ever seen the
|
||||
SGML standard, and since both encounters typically involve a feeling of
|
||||
"missing time", it's not surprising that the terminology of the SGML
|
||||
standard is not closely followed.
|
||||
|
||||
=back
|
||||
|
||||
=item C<_parent> -- (best accessed as C<$e-E<gt>parent>)
|
||||
the element that is C<$obj>'s parent, or undef if this element is the
|
||||
root of its tree.
|
||||
|
||||
=item C<_content> -- (best accessed as C<$e-E<gt>content_list>)
|
||||
the list of nodes (i.e., elements or text segments) that are C<$e>'s
|
||||
children.
|
||||
|
||||
=back
|
||||
|
||||
Moreover, if an element object has any attributes in the SGML sense of
|
||||
the word, then those are readable as C<$e-E<gt>attr('name')> -- for
|
||||
example, with the object built from having parsed "E<lt>a
|
||||
B<id='foo'>E<gt>barE<lt>/aE<gt>", C<$e-E<gt>attr('id')> will return
|
||||
the string "foo". Moreover, C<$e-E<gt>tag> on that object returns the
|
||||
string "a", C<$e-E<gt>content_list> returns a list consisting of just
|
||||
the single scalar "bar", and C<$e-E<gt>parent> returns the object
|
||||
that's this node's parent -- which may be, for example, a "p" element.
|
||||
|
||||
And that's all that there is to it -- you throw HTML
|
||||
source at TreeBuilder, and it returns a tree built of HTML::Element
|
||||
objects and some text strings.
|
||||
|
||||
However, what do you I<do> with a tree of objects? People code
|
||||
information into HTML trees not for the fun of arranging elements, but
|
||||
to represent the structure of specific text and images -- some text is
|
||||
in this "li" element, some other text is in that heading, some
|
||||
images are in that other table cell that has those attributes, and so on.
|
||||
|
||||
Now, it may happen that you're rendering that whole HTML tree into some
|
||||
layout format. Or you could be trying to make some systematic change to
|
||||
the HTML tree before dumping it out as HTML source again. But, in my
|
||||
experience, by far the most common programming task that Perl
|
||||
programmers face with HTML is in trying to extract some piece
|
||||
of information from a larger document. Since that's so common (and
|
||||
also since it involves concepts that are basic to more complex tasks),
|
||||
that is what the rest of this article will be about.
|
||||
|
||||
=head2 Scanning HTML trees
|
||||
|
||||
Suppose you have a thousand HTML documents, each of them a press
|
||||
release. They all start out:
|
||||
|
||||
[...lots of leading images and junk...]
|
||||
<h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
|
||||
BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
|
||||
of world conquest, Rock Feldspar, announced today the opening of a
|
||||
new office in Ougadougou, the capital city of Burkino Faso, gateway
|
||||
to the bustling "Silicon Sahara" of Africa...
|
||||
[...etc...]
|
||||
|
||||
...and what you've got to do is, for each document, copy whatever text
|
||||
is in the "h1" element, so that you can, for example, make a table of
|
||||
contents of it. Now, there are three ways to do this:
|
||||
|
||||
=over
|
||||
|
||||
=item * You can just use a regexp to scan the file for a text pattern.
|
||||
|
||||
For many very simple tasks, this will do fine. Many HTML documents are,
|
||||
in practice, very consistently formatted as far as placement of
|
||||
linebreaks and whitespace, so you could just get away with scanning the
|
||||
file like so:
|
||||
|
||||
sub get_heading {
|
||||
my $filename = $_[0];
|
||||
local *HTML;
|
||||
open(HTML, $filename)
|
||||
or die "Couldn't open $filename);
|
||||
my $heading;
|
||||
Line:
|
||||
while(<HTML>) {
|
||||
if( m{<h1>(.*?)</h1>}i ) { # match it!
|
||||
$heading = $1;
|
||||
last Line;
|
||||
}
|
||||
}
|
||||
close(HTML);
|
||||
warn "No heading in $filename?"
|
||||
unless defined $heading;
|
||||
return $heading;
|
||||
}
|
||||
|
||||
This is quick and fast, but awfully fragile -- if there's a newline in
|
||||
the middle of a heading's text, it won't match the above regexp, and
|
||||
you'll get an error. The regexp will also fail if the "h1" element's
|
||||
start-tag has any attributes. If you have to adapt your code to fit
|
||||
more kinds of start-tags, you'll end up basically reinventing part of
|
||||
HTML::Parser, at which point you should probably just stop, and use
|
||||
HTML::Parser itself:
|
||||
|
||||
=item * You can use HTML::Parser to scan the file for an "h1" start-tag
|
||||
token, then capture all the text tokens until the "h1" close-tag. This
|
||||
approach is extensively covered in the Ken MacFarlane's TPJ17 article
|
||||
"Parsing HTML with HTML::Parser". (A variant of this approach is to use
|
||||
HTML::TokeParser, which presents a different and rather handier
|
||||
interface to the tokens that HTML::Parser picks out.)
|
||||
|
||||
Using HTML::Parser is less fragile than our first approach, since it's
|
||||
not sensitive to the exact internal formatting of the start-tag (much
|
||||
less whether it's split across two lines). However, when you need more
|
||||
information about the context of the "h1" element, or if you're having
|
||||
to deal with any of the tricky bits of HTML, such as parsing of tables,
|
||||
you'll find out the flat list of tokens that HTML::Parser returns
|
||||
isn't immediately useful. To get something useful out of those tokens,
|
||||
you'll need to write code that knows some things about what elements
|
||||
take no content (as with "hr" elements), and that a "</p>" end-tags
|
||||
are omissible, so a "<p>" will end any currently
|
||||
open paragraph -- and you're well on your way to pointlessly
|
||||
reinventing much of the code in HTML::TreeBuilder
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
And, as the person who last rewrote that module, I can attest that it
|
||||
wasn't terribly easy to get right! Never underestimate the perversity
|
||||
of people coding HTML.
|
||||
|
||||
=back
|
||||
|
||||
, at which point you should probably just stop, and use
|
||||
HTML::TreeBuilder itself:
|
||||
|
||||
=item * You can use HTML::Treebuilder, and scan the tree of element
|
||||
objects that you get back.
|
||||
|
||||
=back
|
||||
|
||||
The last approach, using HTML::TreeBuilder, is the diametric opposite of
|
||||
first approach: The first approach involves just elementary Perl and one
|
||||
regexp, whereas the TreeBuilder approach involves being at home with
|
||||
the concept of tree-shaped data structures and modules with
|
||||
object-oriented interfaces, as well as with the particular interfaces
|
||||
that HTML::TreeBuilder and HTML::Element provide.
|
||||
|
||||
However, what the TreeBuilder approach has going for it is that it's
|
||||
the most robust, because it involves dealing with HTML in its "native"
|
||||
format -- it deals with the tree structure that HTML code represents,
|
||||
without any consideration of how the source is coded and with what
|
||||
tags omitted.
|
||||
|
||||
So, to extract the text from the "h1" elements of an HTML document:
|
||||
|
||||
sub get_heading {
|
||||
my $tree = HTML::TreeBuilder->new;
|
||||
$tree->parse_file($_[0]); # !
|
||||
my $heading;
|
||||
my $h1 = $tree->look_down('_tag', 'h1'); # !
|
||||
if($h1) {
|
||||
$heading = $h1->as_text; # !
|
||||
} else {
|
||||
warn "No heading in $_[0]?";
|
||||
}
|
||||
$tree->delete; # clear memory!
|
||||
return $heading;
|
||||
}
|
||||
|
||||
This uses some unfamiliar methods that need explaining. The
|
||||
C<parse_file> method that we've seen before, builds a tree based on
|
||||
source from the file given. The C<delete> method is for marking a
|
||||
tree's contents as available for garbage collection, when you're done
|
||||
with the tree. The C<as_text> method returns a string that contains
|
||||
all the text bits that are children (or otherwise descendants) of the
|
||||
given node -- to get the text content of the C<$h1> object, we could
|
||||
just say:
|
||||
|
||||
$heading = join '', $h1->content_list;
|
||||
|
||||
but that will work only if we're sure that the "h1" element's children
|
||||
will be only text bits -- if the document contained:
|
||||
|
||||
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
||||
|
||||
then the sub-tree would be:
|
||||
|
||||
. h1
|
||||
. "Local Man Sees "
|
||||
. cite
|
||||
. "Blade"
|
||||
. " Again'
|
||||
|
||||
so C<join '', $h1-E<gt>content_list> will be something like:
|
||||
|
||||
Local Man Sees HTML::Element=HASH(0x15424040) Again
|
||||
|
||||
whereas C<$h1-E<gt>as_text> would yield:
|
||||
|
||||
Local Man Sees Blade Again
|
||||
|
||||
and depending on what you're doing with the heading text, you might
|
||||
want the C<as_HTML> method instead. It returns the (sub)tree
|
||||
represented as HTML source. C<$h1-E<gt>as_HTML> would yield:
|
||||
|
||||
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
||||
|
||||
However, if you wanted the contents of C<$h1> as HTML, but not the
|
||||
C<$h1> itself, you could say:
|
||||
|
||||
join '',
|
||||
map(
|
||||
ref($_) ? $_->as_HTML : $_,
|
||||
$h1->content_list
|
||||
)
|
||||
|
||||
This C<map> iterates over the nodes in C<$h1>'s list of children; and
|
||||
for each node that's just a text bit (as "Local Man Sees " is), it just
|
||||
passes through that string value, and for each node that's an actual
|
||||
object (causing C<ref> to be true), C<as_HTML> will used instead of the
|
||||
string value of the object itself (which would be something quite
|
||||
useless, as most object values are). So that C<as_HTML> for the "cite"
|
||||
element will be the string "<cite>BladeE<lt>/cite>". And then,
|
||||
finally, C<join> just puts into one string all the strings that the
|
||||
C<map> returns.
|
||||
|
||||
Last but not least, the most important method in our C<get_heading> sub
|
||||
is the C<look_down> method. This method looks down at the subtree
|
||||
starting at the given object (C<$h1>), looking for elements that meet
|
||||
criteria you provide.
|
||||
|
||||
The criteria are specified in the method's argument list. Each
|
||||
criterion can consist of two scalars, a key and a value, which express
|
||||
that you want elements that have that attribute (like "_tag", or
|
||||
"src") with the given value ("h1"); or the criterion can be a
|
||||
reference to a subroutine that, when called on the given element,
|
||||
returns true if that is a node you're looking for. If you specify
|
||||
several criteria, then that's taken to mean that you want all the
|
||||
elements that each satisfy I<all> the criteria. (In other words,
|
||||
there's an "implicit AND".)
|
||||
|
||||
And finally, there's a bit of an optimization -- if you call the
|
||||
C<look_down> method in a scalar context, you get just the I<first> node
|
||||
(or undef if none) -- and, in fact, once C<look_down> finds that first
|
||||
matching element, it doesn't bother looking any further.
|
||||
|
||||
So the example:
|
||||
|
||||
$h1 = $tree->look_down('_tag', 'h1');
|
||||
|
||||
returns the first element at-or-under C<$tree> whose C<"_tag">
|
||||
attribute has the value C<"h1">.
|
||||
|
||||
=head2 Complex Criteria in Tree Scanning
|
||||
|
||||
Now, the above C<look_down> code looks like a lot of bother, with
|
||||
barely more benefit than just grepping the file! But consider if your
|
||||
criteria were more complicated -- suppose you found that some of the
|
||||
press releases that you were scanning had several "h1" elements,
|
||||
possibly before or after the one you actually want. For example:
|
||||
|
||||
<h1><center>Visit Our Corporate Partner
|
||||
<br><a href="/dyna/clickthru"
|
||||
><img src="/dyna/vend_ad"></a>
|
||||
</center></h1>
|
||||
<h1><center>ConGlomCo President Schreck to Visit Regional HQ
|
||||
<br><a href="/photos/Schreck_visit_large.jpg"
|
||||
><img src="/photos/Schreck_visit.jpg"></a>
|
||||
</center></h1>
|
||||
|
||||
Here, you want to ignore the first "h1" element because it contains an
|
||||
ad, and you want the text from the second "h1". The problem is in
|
||||
formalizing the way you know that it's an ad. Since ad banners are
|
||||
always entreating you to "visit" the sponsoring site, you could exclude
|
||||
"h1" elements that contain the word "visit" under them:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
$_[0]->as_text !~ m/\bvisit/i
|
||||
}
|
||||
);
|
||||
|
||||
The first criterion looks for "h1" elements, and the second criterion
|
||||
limits those to only the ones whose text content doesn't match
|
||||
C<m/\bvisit/>. But unfortunately, that won't work for our example,
|
||||
since the second "h1" mentions "ConGlomCo President Schreck to
|
||||
I<Visit> Regional HQ".
|
||||
|
||||
Instead you could try looking for the first "h1" element that
|
||||
doesn't contain an image:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
not $_[0]->look_down('_tag', 'img')
|
||||
}
|
||||
);
|
||||
|
||||
This criterion sub might seem a bit odd, since it calls C<look_down>
|
||||
as part of a larger C<look_down> operation, but that's fine. Note that
|
||||
when considered as a boolean value, a C<look_down> in a scalar context
|
||||
value returns false (specifically, undef) if there's no matching element
|
||||
at or under the given element; and it returns the first matching
|
||||
element (which, being a reference and object, is always a true value),
|
||||
if any matches. So, here,
|
||||
|
||||
sub {
|
||||
not $_[0]->look_down('_tag', 'img')
|
||||
}
|
||||
|
||||
means "return true only if this element has no 'img' element as
|
||||
descendants (and isn't an 'img' element itself)."
|
||||
|
||||
This correctly filters out the first "h1" that contains the ad, but it
|
||||
also incorrectly filters out the second "h1" that contains a
|
||||
non-advertisement photo besides the headline text you want.
|
||||
|
||||
There clearly are detectable differences between the first and second
|
||||
"h1" elements -- the only second one contains the string "Schreck", and
|
||||
we could just test for that:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
$_[0]->as_text =~ m{Schreck}
|
||||
}
|
||||
);
|
||||
|
||||
And that works fine for this one example, but unless all thousand of
|
||||
your press releases have "Schreck" in the headline, that's just not a
|
||||
general solution. However, if all the ads-in-"h1"s that you want to
|
||||
exclude involve a link whose URL involves "/dyna/", then you can use
|
||||
that:
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
my $link = $_[0]->look_down('_tag','a');
|
||||
return 1 unless $link;
|
||||
# no link means it's fine
|
||||
return 0 if $link->attr('href') =~ m{/dyna/};
|
||||
# a link to there is bad
|
||||
return 1; # otherwise okay
|
||||
}
|
||||
);
|
||||
|
||||
Or you can look at it another way and say that you want the first "h1"
|
||||
element that either contains no images, or else whose image has a "src"
|
||||
attribute whose value contains "/photos/":
|
||||
|
||||
my $real_h1 = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
sub {
|
||||
my $img = $_[0]->look_down('_tag','img');
|
||||
return 1 unless $img;
|
||||
# no image means it's fine
|
||||
return 1 if $img->attr('src') =~ m{/photos/};
|
||||
# good if a photo
|
||||
return 0; # otherwise bad
|
||||
}
|
||||
);
|
||||
|
||||
Recall that this use of C<look_down> in a scalar context means to return
|
||||
the first element at or under C<$tree> that matches all the criteria.
|
||||
But if you notice that you can formulate criteria that'll match several
|
||||
possible "h1" elements, some of which may be bogus but the I<last> one
|
||||
of which is always the one you want, then you can use C<look_down> in a
|
||||
list context, and just use the last element of that list:
|
||||
|
||||
my @h1s = $tree->look_down(
|
||||
'_tag', 'h1',
|
||||
...maybe more criteria...
|
||||
);
|
||||
die "What, no h1s here?" unless @h1s;
|
||||
my $real_h1 = $h1s[-1]; # last or only
|
||||
|
||||
=head2 A Case Study: Scanning Yahoo News's HTML
|
||||
|
||||
The above (somewhat contrived) case involves extracting data from a
|
||||
bunch of pre-existing HTML files. In that sort of situation, if your
|
||||
code works for all the files, then you know that the code I<works> --
|
||||
since the data it's meant to handle won't go changing or growing; and,
|
||||
typically, once you've used the program, you'll never need to use it
|
||||
again.
|
||||
|
||||
The other kind of situation faced in many data extraction tasks is
|
||||
where the program is used recurringly to handle new data -- such as
|
||||
from ever-changing Web pages. As a real-world example of this,
|
||||
consider a program that you could use (suppose it's crontabbed) to
|
||||
extract headline-links from subsections of Yahoo News
|
||||
(C<http://dailynews.yahoo.com/>).
|
||||
|
||||
Yahoo News has several subsections:
|
||||
|
||||
=over
|
||||
|
||||
=item http://dailynews.yahoo.com/h/tc/ for technology news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/sc/ for science news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/hl/ for health news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/wl/ for world news
|
||||
|
||||
=item http://dailynews.yahoo.com/h/en/ for entertainment news
|
||||
|
||||
=back
|
||||
|
||||
and others. All of them are built on the same basic HTML template --
|
||||
and a scarily complicated template it is, especially when you look at
|
||||
it with an eye toward making up rules that will select where the real
|
||||
headline-links are, while screening out all the links to other parts of
|
||||
Yahoo, other news services, etc. You will need to puzzle
|
||||
over the HTML source, and scrutinize the output of
|
||||
C<$tree-E<gt>dump> on the parse tree of that HTML.
|
||||
|
||||
Sometimes the only way to pin down what you're after is by position in
|
||||
the tree. For example, headlines of interest may be in the third
|
||||
column of the second row of the second table element in a page:
|
||||
|
||||
my $table = ( $tree->look_down('_tag','table') )[1];
|
||||
my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
|
||||
my $col3 = ( $row2->look-down('_tag', 'td') )[2];
|
||||
...then do things with $col3...
|
||||
|
||||
Or they may be all the links in a "p" element that has at least three
|
||||
"br" elements as children:
|
||||
|
||||
my $p = $tree->look_down(
|
||||
'_tag', 'p',
|
||||
sub {
|
||||
2 < grep { ref($_) and $_->tag eq 'br' }
|
||||
$_[0]->content_list
|
||||
}
|
||||
);
|
||||
@links = $p->look_down('_tag', 'a');
|
||||
|
||||
But almost always, you can get away with looking for properties of the
|
||||
of the thing itself, rather than just looking for contexts. Now, if
|
||||
you're lucky, the document you're looking through has clear semantic
|
||||
tagging, such is as useful in CSS -- note the
|
||||
class="headlinelink" bit here:
|
||||
|
||||
<a href="...long_news_url..." class="headlinelink">Elvis
|
||||
seen in tortilla</a>
|
||||
|
||||
If you find anything like that, you could leap right in and select
|
||||
links with:
|
||||
|
||||
@links = $tree->look_down('class','headlinelink');
|
||||
|
||||
Regrettably, your chances of seeing any sort of semantic markup
|
||||
principles really being followed with actual HTML are pretty thin.
|
||||
|
||||
=over
|
||||
|
||||
Footnote:
|
||||
In fact, your chances of finding a page that is simply free of HTML
|
||||
errors are even thinner. And surprisingly, sites like Amazon or Yahoo
|
||||
are typically worse as far as quality of code than personal sites
|
||||
whose entire production cycle involves simply being saved and uploaded
|
||||
from Netscape Composer.
|
||||
|
||||
=back
|
||||
|
||||
The code may be sort of "accidentally semantic", however -- for example,
|
||||
in a set of pages I was scanning recently, I found that looking for
|
||||
"td" elements with a "width" attribute value of "375" got me exactly
|
||||
what I wanted. No-one designing that page ever conceived of
|
||||
"width=375" as I<meaning> "this is a headline", but if you impute it
|
||||
to mean that, it works.
|
||||
|
||||
An approach like this happens to work for the Yahoo News code, because
|
||||
the headline-links are distinguished by the fact that they (and they
|
||||
alone) contain a "b" element:
|
||||
|
||||
<a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
|
||||
|
||||
or, diagrammed as a part of the parse tree:
|
||||
|
||||
. a [href="...long_news_url..."]
|
||||
. b
|
||||
. "Elvis seen in tortilla"
|
||||
|
||||
A rule that matches these can be formalized as "look for any 'a'
|
||||
element that has only one daughter node, which must be a 'b' element".
|
||||
And this is what it looks like when cooked up as a C<look_down>
|
||||
expression and prefaced with a bit of code that retrieves the text of
|
||||
the given Yahoo News page and feeds it to TreeBuilder:
|
||||
|
||||
use strict;
|
||||
use HTML::TreeBuilder 2.97;
|
||||
use LWP::UserAgent;
|
||||
sub get_headlines {
|
||||
my $url = $_[0] || die "What URL?";
|
||||
|
||||
my $response = LWP::UserAgent->new->request(
|
||||
HTTP::Request->new( GET => $url )
|
||||
);
|
||||
unless($response->is_success) {
|
||||
warn "Couldn't get $url: ", $response->status_line, "\n";
|
||||
return;
|
||||
}
|
||||
|
||||
my $tree = HTML::TreeBuilder->new();
|
||||
$tree->parse($response->content);
|
||||
$tree->eof;
|
||||
|
||||
my @out;
|
||||
foreach my $link (
|
||||
$tree->look_down( # !
|
||||
'_tag', 'a',
|
||||
sub {
|
||||
return unless $_[0]->attr('href');
|
||||
my @c = $_[0]->content_list;
|
||||
@c == 1 and ref $c[0] and $c[0]->tag eq 'b';
|
||||
}
|
||||
)
|
||||
) {
|
||||
push @out, [ $link->attr('href'), $link->as_text ];
|
||||
}
|
||||
|
||||
warn "Odd, fewer than 6 stories in $url!" if @out < 6;
|
||||
$tree->delete;
|
||||
return @out;
|
||||
}
|
||||
|
||||
...and add a bit of code to actually call that routine and display the
|
||||
results...
|
||||
|
||||
foreach my $section (qw[tc sc hl wl en]) {
|
||||
my @links = get_headlines(
|
||||
"http://dailynews.yahoo.com/h/$section/"
|
||||
);
|
||||
print
|
||||
$section, ": ", scalar(@links), " stories\n",
|
||||
map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
|
||||
"\n";
|
||||
}
|
||||
|
||||
And we've got our own headline-extractor service! This in and of
|
||||
itself isn't no amazingly useful (since if you want to see the
|
||||
headlines, you I<can> just look at the Yahoo News pages), but it could
|
||||
easily be the basis for quite useful features like filtering the
|
||||
headlines for matching certain keywords of interest to you.
|
||||
|
||||
Now, one of these days, Yahoo News will decide to change its HTML
|
||||
template. When this happens, this will appear to the above program as
|
||||
there being no links that meet the given criteria; or, less likely,
|
||||
dozens of erroneous links will meet the criteria. In either case, the
|
||||
criteria will have to be changed for the new template; they may just
|
||||
need adjustment, or you may need to scrap them and start over.
|
||||
|
||||
=head2 I<Regardez, duvet!>
|
||||
|
||||
It's often quite a challenge to write criteria to match the desired
|
||||
parts of an HTML parse tree. Very often you I<can> pull it off with a
|
||||
simple C<$tree-E<gt>look_down('_tag', 'h1')>, but sometimes you do
|
||||
have to keep adding and refining criteria, until you might end up with
|
||||
complex filters like what I've shown in this article. The
|
||||
benefit to learning how to deal with HTML parse trees is that one main
|
||||
search tool, the C<look_down> method, can do most of the work, making
|
||||
simple things easy, while still making hard things possible.
|
||||
|
||||
B<[end body of article]>
|
||||
|
||||
=head2 [Author Credit]
|
||||
|
||||
Sean M. Burke (C<sburke@cpan.org>) is the current maintainer of
|
||||
C<HTML::TreeBuilder> and C<HTML::Element>, both originally by
|
||||
Gisle Aas.
|
||||
|
||||
Sean adds: "I'd like to thank the folks who listened to me ramble
|
||||
incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet
|
||||
Another Perl Conference and O'Reilly Open Source Software Convention."
|
||||
|
||||
=head1 BACK
|
||||
|
||||
Return to the L<HTML::Tree|HTML::Tree> docs.
|
||||
|
||||
=cut
|
||||
|
||||
Reference in New Issue
Block a user