Initial Commit
This commit is contained in:
310
database/perl/vendor/lib/libwww/lwpcook.pod
vendored
Normal file
310
database/perl/vendor/lib/libwww/lwpcook.pod
vendored
Normal file
@@ -0,0 +1,310 @@
|
||||
=head1 NAME
|
||||
|
||||
lwpcook - The libwww-perl cookbook
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
This document contain some examples that show typical usage of the
|
||||
libwww-perl library. You should consult the documentation for the
|
||||
individual modules for more detail.
|
||||
|
||||
All examples should be runnable programs. You can, in most cases, test
|
||||
the code sections by piping the program text directly to perl.
|
||||
|
||||
|
||||
|
||||
=head1 GET
|
||||
|
||||
It is very easy to use this library to just fetch documents from the
|
||||
net. The LWP::Simple module provides the get() function that return
|
||||
the document specified by its URL argument:
|
||||
|
||||
use LWP::Simple;
|
||||
$doc = get 'http://search.cpan.org/dist/libwww-perl/';
|
||||
|
||||
or, as a perl one-liner using the getprint() function:
|
||||
|
||||
perl -MLWP::Simple -e 'getprint "http://search.cpan.org/dist/libwww-perl/"'
|
||||
|
||||
or, how about fetching the latest perl by running this command:
|
||||
|
||||
perl -MLWP::Simple -e '
|
||||
getstore "ftp://ftp.sunet.se/pub/lang/perl/CPAN/src/latest.tar.gz",
|
||||
"perl.tar.gz"'
|
||||
|
||||
You will probably first want to find a CPAN site closer to you by
|
||||
running something like the following command:
|
||||
|
||||
perl -MLWP::Simple -e 'getprint "http://www.cpan.org/SITES.html"'
|
||||
|
||||
Enough of this simple stuff! The LWP object oriented interface gives
|
||||
you more control over the request sent to the server. Using this
|
||||
interface you have full control over headers sent and how you want to
|
||||
handle the response returned.
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
$ua->agent("$0/0.1 " . $ua->agent);
|
||||
# $ua->agent("Mozilla/8.0") # pretend we are very capable browser
|
||||
|
||||
$req = HTTP::Request->new(
|
||||
GET => 'http://search.cpan.org/dist/libwww-perl/');
|
||||
$req->header('Accept' => 'text/html');
|
||||
|
||||
# send request
|
||||
$res = $ua->request($req);
|
||||
|
||||
# check the outcome
|
||||
if ($res->is_success) {
|
||||
print $res->decoded_content;
|
||||
}
|
||||
else {
|
||||
print "Error: " . $res->status_line . "\n";
|
||||
}
|
||||
|
||||
The lwp-request program (alias GET) that is distributed with the
|
||||
library can also be used to fetch documents from WWW servers.
|
||||
|
||||
|
||||
|
||||
=head1 HEAD
|
||||
|
||||
If you just want to check if a document is present (i.e. the URL is
|
||||
valid) try to run code that looks like this:
|
||||
|
||||
use LWP::Simple;
|
||||
|
||||
if (head($url)) {
|
||||
# ok document exists
|
||||
}
|
||||
|
||||
The head() function really returns a list of meta-information about
|
||||
the document. The first three values of the list returned are the
|
||||
document type, the size of the document, and the age of the document.
|
||||
|
||||
More control over the request or access to all header values returned
|
||||
require that you use the object oriented interface described for GET
|
||||
above. Just s/GET/HEAD/g.
|
||||
|
||||
|
||||
=head1 POST
|
||||
|
||||
There is no simple procedural interface for posting data to a WWW server. You
|
||||
must use the object oriented interface for this. The most common POST
|
||||
operation is to access a WWW form application:
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
|
||||
my $req = HTTP::Request->new(
|
||||
POST => 'https://rt.cpan.org/Public/Dist/Display.html');
|
||||
$req->content_type('application/x-www-form-urlencoded');
|
||||
$req->content('Status=Active&Name=libwww-perl');
|
||||
|
||||
my $res = $ua->request($req);
|
||||
print $res->as_string;
|
||||
|
||||
Lazy people use the HTTP::Request::Common module to set up a suitable
|
||||
POST request message (it handles all the escaping issues) and has a
|
||||
suitable default for the content_type:
|
||||
|
||||
use HTTP::Request::Common qw(POST);
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
|
||||
my $req = POST 'https://rt.cpan.org/Public/Dist/Display.html',
|
||||
[ Status => 'Active', Name => 'libwww-perl' ];
|
||||
|
||||
print $ua->request($req)->as_string;
|
||||
|
||||
The lwp-request program (alias POST) that is distributed with the
|
||||
library can also be used for posting data.
|
||||
|
||||
|
||||
|
||||
=head1 PROXIES
|
||||
|
||||
Some sites use proxies to go through fire wall machines, or just as
|
||||
cache in order to improve performance. Proxies can also be used for
|
||||
accessing resources through protocols not supported directly (or
|
||||
supported badly :-) by the libwww-perl library.
|
||||
|
||||
You should initialize your proxy setting before you start sending
|
||||
requests:
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
$ua->env_proxy; # initialize from environment variables
|
||||
# or
|
||||
$ua->proxy(ftp => 'http://proxy.myorg.com');
|
||||
$ua->proxy(wais => 'http://proxy.myorg.com');
|
||||
$ua->no_proxy(qw(no se fi));
|
||||
|
||||
my $req = HTTP::Request->new(GET => 'wais://xxx.com/');
|
||||
print $ua->request($req)->as_string;
|
||||
|
||||
The LWP::Simple interface will call env_proxy() for you automatically.
|
||||
Applications that use the $ua->env_proxy() method will normally not
|
||||
use the $ua->proxy() and $ua->no_proxy() methods.
|
||||
|
||||
Some proxies also require that you send it a username/password in
|
||||
order to let requests through. You should be able to add the
|
||||
required header, with something like this:
|
||||
|
||||
use LWP::UserAgent;
|
||||
|
||||
$ua = LWP::UserAgent->new;
|
||||
$ua->proxy(['http', 'ftp'] => 'http://username:password@proxy.myorg.com');
|
||||
|
||||
$req = HTTP::Request->new('GET',"http://www.perl.com");
|
||||
|
||||
$res = $ua->request($req);
|
||||
print $res->decoded_content if $res->is_success;
|
||||
|
||||
Replace C<proxy.myorg.com>, C<username> and
|
||||
C<password> with something suitable for your site.
|
||||
|
||||
|
||||
=head1 ACCESS TO PROTECTED DOCUMENTS
|
||||
|
||||
Documents protected by basic authorization can easily be accessed
|
||||
like this:
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
$req = HTTP::Request->new(GET => 'http://www.linpro.no/secret/');
|
||||
$req->authorization_basic('aas', 'mypassword');
|
||||
print $ua->request($req)->as_string;
|
||||
|
||||
The other alternative is to provide a subclass of I<LWP::UserAgent> that
|
||||
overrides the get_basic_credentials() method. Study the I<lwp-request>
|
||||
program for an example of this.
|
||||
|
||||
|
||||
=head1 COOKIES
|
||||
|
||||
Some sites like to play games with cookies. By default LWP ignores
|
||||
cookies provided by the servers it visits. LWP will collect cookies
|
||||
and respond to cookie requests if you set up a cookie jar. LWP doesn't
|
||||
provide a cookie jar itself, but if you install L<HTTP::CookieJar::LWP>,
|
||||
it can be used like this:
|
||||
|
||||
use LWP::UserAgent;
|
||||
use HTTP::CookieJar::LWP;
|
||||
|
||||
$ua = LWP::UserAgent->new(
|
||||
cookie_jar => HTTP::CookieJar::LWP->new,
|
||||
);
|
||||
|
||||
# and then send requests just as you used to do
|
||||
$res = $ua->request(HTTP::Request->new(GET => "http://no.yahoo.com/"));
|
||||
print $res->status_line, "\n";
|
||||
|
||||
=head1 HTTPS
|
||||
|
||||
URLs with https scheme are accessed in exactly the same way as with
|
||||
http scheme, provided that an SSL interface module for LWP has been
|
||||
properly installed (see the F<README.SSL> file found in the
|
||||
libwww-perl distribution for more details). If no SSL interface is
|
||||
installed for LWP to use, then you will get "501 Protocol scheme
|
||||
'https' is not supported" errors when accessing such URLs.
|
||||
|
||||
Here's an example of fetching and printing a WWW page using SSL:
|
||||
|
||||
use LWP::UserAgent;
|
||||
|
||||
my $ua = LWP::UserAgent->new;
|
||||
my $req = HTTP::Request->new(GET => 'https://www.helsinki.fi/');
|
||||
my $res = $ua->request($req);
|
||||
if ($res->is_success) {
|
||||
print $res->as_string;
|
||||
}
|
||||
else {
|
||||
print "Failed: ", $res->status_line, "\n";
|
||||
}
|
||||
|
||||
=head1 MIRRORING
|
||||
|
||||
If you want to mirror documents from a WWW server, then try to run
|
||||
code similar to this at regular intervals:
|
||||
|
||||
use LWP::Simple;
|
||||
|
||||
%mirrors = (
|
||||
'http://www.sn.no/' => 'sn.html',
|
||||
'http://www.perl.com/' => 'perl.html',
|
||||
'http://search.cpan.org/distlibwww-perl/' => 'lwp.html',
|
||||
'gopher://gopher.sn.no/' => 'gopher.html',
|
||||
);
|
||||
|
||||
while (($url, $localfile) = each(%mirrors)) {
|
||||
mirror($url, $localfile);
|
||||
}
|
||||
|
||||
Or, as a perl one-liner:
|
||||
|
||||
perl -MLWP::Simple -e 'mirror("http://www.perl.com/", "perl.html")';
|
||||
|
||||
The document will not be transferred unless it has been updated.
|
||||
|
||||
|
||||
|
||||
=head1 LARGE DOCUMENTS
|
||||
|
||||
If the document you want to fetch is too large to be kept in memory,
|
||||
then you have two alternatives. You can instruct the library to write
|
||||
the document content to a file (second $ua->request() argument is a file
|
||||
name):
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
|
||||
my $req = HTTP::Request->new(GET =>
|
||||
'http://www.cpan.org/CPAN/authors/id/O/OA/OALDERS/libwww-perl-6.26.tar.gz');
|
||||
$res = $ua->request($req, "libwww-perl.tar.gz");
|
||||
if ($res->is_success) {
|
||||
print "ok\n";
|
||||
}
|
||||
else {
|
||||
print $res->status_line, "\n";
|
||||
}
|
||||
|
||||
|
||||
Or you can process the document as it arrives (second $ua->request()
|
||||
argument is a code reference):
|
||||
|
||||
use LWP::UserAgent;
|
||||
$ua = LWP::UserAgent->new;
|
||||
$URL = 'ftp://ftp.isc.org/pub/rfc/rfc-index.txt';
|
||||
|
||||
my $expected_length;
|
||||
my $bytes_received = 0;
|
||||
my $res =
|
||||
$ua->request(HTTP::Request->new(GET => $URL),
|
||||
sub {
|
||||
my($chunk, $res) = @_;
|
||||
$bytes_received += length($chunk);
|
||||
unless (defined $expected_length) {
|
||||
$expected_length = $res->content_length || 0;
|
||||
}
|
||||
if ($expected_length) {
|
||||
printf STDERR "%d%% - ",
|
||||
100 * $bytes_received / $expected_length;
|
||||
}
|
||||
print STDERR "$bytes_received bytes received\n";
|
||||
|
||||
# XXX Should really do something with the chunk itself
|
||||
# print $chunk;
|
||||
});
|
||||
print $res->status_line, "\n";
|
||||
|
||||
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 1996-2001, Gisle Aas
|
||||
|
||||
This library is free software; you can redistribute it and/or
|
||||
modify it under the same terms as Perl itself.
|
||||
|
||||
|
||||
820
database/perl/vendor/lib/libwww/lwptut.pod
vendored
Normal file
820
database/perl/vendor/lib/libwww/lwptut.pod
vendored
Normal file
@@ -0,0 +1,820 @@
|
||||
=head1 NAME
|
||||
|
||||
lwptut -- An LWP Tutorial
|
||||
|
||||
=head1 DESCRIPTION
|
||||
|
||||
LWP (short for "Library for WWW in Perl") is a very popular group of
|
||||
Perl modules for accessing data on the Web. Like most Perl
|
||||
module-distributions, each of LWP's component modules comes with
|
||||
documentation that is a complete reference to its interface. However,
|
||||
there are so many modules in LWP that it's hard to know where to start
|
||||
looking for information on how to do even the simplest most common
|
||||
things.
|
||||
|
||||
Really introducing you to using LWP would require a whole book -- a book
|
||||
that just happens to exist, called I<Perl & LWP>. But this article
|
||||
should give you a taste of how you can go about some common tasks with
|
||||
LWP.
|
||||
|
||||
|
||||
=head2 Getting documents with LWP::Simple
|
||||
|
||||
If you just want to get what's at a particular URL, the simplest way
|
||||
to do it is LWP::Simple's functions.
|
||||
|
||||
In a Perl program, you can call its C<get($url)> function. It will try
|
||||
getting that URL's content. If it works, then it'll return the
|
||||
content; but if there's some error, it'll return undef.
|
||||
|
||||
my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
|
||||
# Just an example: the URL for the most recent /Fresh Air/ show
|
||||
|
||||
use LWP::Simple;
|
||||
my $content = get $url;
|
||||
die "Couldn't get $url" unless defined $content;
|
||||
|
||||
# Then go do things with $content, like this:
|
||||
|
||||
if($content =~ m/jazz/i) {
|
||||
print "They're talking about jazz today on Fresh Air!\n";
|
||||
}
|
||||
else {
|
||||
print "Fresh Air is apparently jazzless today.\n";
|
||||
}
|
||||
|
||||
The handiest variant on C<get> is C<getprint>, which is useful in Perl
|
||||
one-liners. If it can get the page whose URL you provide, it sends it
|
||||
to STDOUT; otherwise it complains to STDERR.
|
||||
|
||||
% perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'"
|
||||
|
||||
That is the URL of a plain text file that lists new files in CPAN in
|
||||
the past two weeks. You can easily make it part of a tidy little
|
||||
shell command, like this one that mails you the list of new
|
||||
C<Acme::> modules:
|
||||
|
||||
% perl -MLWP::Simple -e "getprint 'http://www.cpan.org/RECENT'" \
|
||||
| grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
|
||||
|
||||
There are other useful functions in LWP::Simple, including one function
|
||||
for running a HEAD request on a URL (useful for checking links, or
|
||||
getting the last-revised time of a URL), and two functions for
|
||||
saving/mirroring a URL to a local file. See L<the LWP::Simple
|
||||
documentation|LWP::Simple> for the full details, or chapter 2 of I<Perl
|
||||
& LWP> for more examples.
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 The Basics of the LWP Class Model
|
||||
|
||||
LWP::Simple's functions are handy for simple cases, but its functions
|
||||
don't support cookies or authorization, don't support setting header
|
||||
lines in the HTTP request, generally don't support reading header lines
|
||||
in the HTTP response (notably the full HTTP error message, in case of an
|
||||
error). To get at all those features, you'll have to use the full LWP
|
||||
class model.
|
||||
|
||||
While LWP consists of dozens of classes, the main two that you have to
|
||||
understand are L<LWP::UserAgent> and L<HTTP::Response>. LWP::UserAgent
|
||||
is a class for "virtual browsers" which you use for performing requests,
|
||||
and L<HTTP::Response> is a class for the responses (or error messages)
|
||||
that you get back from those requests.
|
||||
|
||||
The basic idiom is C<< $response = $browser->get($url) >>, or more fully
|
||||
illustrated:
|
||||
|
||||
# Early in your program:
|
||||
|
||||
use LWP 5.64; # Loads all important LWP classes, and makes
|
||||
# sure your version is reasonably recent.
|
||||
|
||||
my $browser = LWP::UserAgent->new;
|
||||
|
||||
...
|
||||
|
||||
# Then later, whenever you need to make a get request:
|
||||
my $url = 'http://www.npr.org/programs/fa/?todayDate=current';
|
||||
|
||||
my $response = $browser->get( $url );
|
||||
die "Can't get $url -- ", $response->status_line
|
||||
unless $response->is_success;
|
||||
|
||||
die "Hey, I was expecting HTML, not ", $response->content_type
|
||||
unless $response->content_type eq 'text/html';
|
||||
# or whatever content-type you're equipped to deal with
|
||||
|
||||
# Otherwise, process the content somehow:
|
||||
|
||||
if($response->decoded_content =~ m/jazz/i) {
|
||||
print "They're talking about jazz today on Fresh Air!\n";
|
||||
}
|
||||
else {
|
||||
print "Fresh Air is apparently jazzless today.\n";
|
||||
}
|
||||
|
||||
There are two objects involved: C<$browser>, which holds an object of
|
||||
class LWP::UserAgent, and then the C<$response> object, which is of
|
||||
class HTTP::Response. You really need only one browser object per
|
||||
program; but every time you make a request, you get back a new
|
||||
HTTP::Response object, which will have some interesting attributes:
|
||||
|
||||
=over
|
||||
|
||||
=item *
|
||||
|
||||
A status code indicating
|
||||
success or failure
|
||||
(which you can test with C<< $response->is_success >>).
|
||||
|
||||
=item *
|
||||
|
||||
An HTTP status
|
||||
line that is hopefully informative if there's failure (which you can
|
||||
see with C<< $response->status_line >>,
|
||||
returning something like "404 Not Found").
|
||||
|
||||
=item *
|
||||
|
||||
A MIME content-type like "text/html", "image/gif",
|
||||
"application/xml", etc., which you can see with
|
||||
C<< $response->content_type >>
|
||||
|
||||
=item *
|
||||
|
||||
The actual content of the response, in C<< $response->decoded_content >>.
|
||||
If the response is HTML, that's where the HTML source will be; if
|
||||
it's a GIF, then C<< $response->decoded_content >> will be the binary
|
||||
GIF data.
|
||||
|
||||
=item *
|
||||
|
||||
And dozens of other convenient and more specific methods that are
|
||||
documented in the docs for L<HTTP::Response>, and its superclasses
|
||||
L<HTTP::Message> and L<HTTP::Headers>.
|
||||
|
||||
=back
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Adding Other HTTP Request Headers
|
||||
|
||||
The most commonly used syntax for requests is C<< $response =
|
||||
$browser->get($url) >>, but in truth, you can add extra HTTP header
|
||||
lines to the request by adding a list of key-value pairs after the URL,
|
||||
like so:
|
||||
|
||||
$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
|
||||
|
||||
For example, here's how to send some commonly used headers, in case
|
||||
you're dealing with a site that would otherwise reject your request:
|
||||
|
||||
|
||||
my @ns_headers = (
|
||||
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
|
||||
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
|
||||
'Accept-Charset' => 'iso-8859-1,*,utf-8',
|
||||
'Accept-Language' => 'en-US',
|
||||
);
|
||||
|
||||
...
|
||||
|
||||
$response = $browser->get($url, @ns_headers);
|
||||
|
||||
If you weren't reusing that array, you could just go ahead and do this:
|
||||
|
||||
$response = $browser->get($url,
|
||||
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
|
||||
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
|
||||
'Accept-Charset' => 'iso-8859-1,*,utf-8',
|
||||
'Accept-Language' => 'en-US',
|
||||
);
|
||||
|
||||
If you were only ever changing the 'User-Agent' line, you could just change
|
||||
the C<$browser> object's default line from "libwww-perl/5.65" (or the like)
|
||||
to whatever you like, using the LWP::UserAgent C<agent> method:
|
||||
|
||||
$browser->agent('Mozilla/4.76 [en] (Win98; U)');
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Enabling Cookies
|
||||
|
||||
A default LWP::UserAgent object acts like a browser with its cookies
|
||||
support turned off. There are various ways of turning it on, by setting
|
||||
its C<cookie_jar> attribute. A "cookie jar" is an object representing
|
||||
a little database of all
|
||||
the HTTP cookies that a browser knows about. It can correspond to a
|
||||
file on disk or
|
||||
an in-memory object that starts out empty, and whose collection of
|
||||
cookies will disappear once the program is finished running.
|
||||
|
||||
To give a browser an in-memory empty cookie jar, you set its C<cookie_jar>
|
||||
attribute like so:
|
||||
|
||||
use HTTP::CookieJar::LWP;
|
||||
$browser->cookie_jar( HTTP::CookieJar::LWP->new );
|
||||
|
||||
To save a cookie jar to disk, see L<< HTTP::CookieJar/dump_cookies >>.
|
||||
To load cookies from disk into a jar, see L<<
|
||||
HTTP::CookieJar/load_cookies >>.
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Posting Form Data
|
||||
|
||||
Many HTML forms send data to their server using an HTTP POST request, which
|
||||
you can send with this syntax:
|
||||
|
||||
$response = $browser->post( $url,
|
||||
[
|
||||
formkey1 => value1,
|
||||
formkey2 => value2,
|
||||
...
|
||||
],
|
||||
);
|
||||
|
||||
Or if you need to send HTTP headers:
|
||||
|
||||
$response = $browser->post( $url,
|
||||
[
|
||||
formkey1 => value1,
|
||||
formkey2 => value2,
|
||||
...
|
||||
],
|
||||
headerkey1 => value1,
|
||||
headerkey2 => value2,
|
||||
);
|
||||
|
||||
For example, the following program makes a search request to AltaVista
|
||||
(by sending some form data via an HTTP POST request), and extracts from
|
||||
the HTML the report of the number of matches:
|
||||
|
||||
use strict;
|
||||
use warnings;
|
||||
use LWP 5.64;
|
||||
my $browser = LWP::UserAgent->new;
|
||||
|
||||
my $word = 'tarragon';
|
||||
|
||||
my $url = 'http://search.yahoo.com/yhs/search';
|
||||
my $response = $browser->post( $url,
|
||||
[ 'q' => $word, # the Altavista query string
|
||||
'fr' => 'altavista', 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
|
||||
]
|
||||
);
|
||||
die "$url error: ", $response->status_line
|
||||
unless $response->is_success;
|
||||
die "Weird content type at $url -- ", $response->content_type
|
||||
unless $response->content_is_html;
|
||||
|
||||
if( $response->decoded_content =~ m{([0-9,]+)(?:<.*?>)? results for} ) {
|
||||
# The substring will be like "996,000</strong> results for"
|
||||
print "$word: $1\n";
|
||||
}
|
||||
else {
|
||||
print "Couldn't find the match-string in the response\n";
|
||||
}
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Sending GET Form Data
|
||||
|
||||
Some HTML forms convey their form data not by sending the data
|
||||
in an HTTP POST request, but by making a normal GET request with
|
||||
the data stuck on the end of the URL. For example, if you went to
|
||||
C<www.imdb.com> and ran a search on "Blade Runner", the URL you'd see
|
||||
in your browser window would be:
|
||||
|
||||
http://www.imdb.com/find?s=all&q=Blade+Runner
|
||||
|
||||
To run the same search with LWP, you'd use this idiom, which involves
|
||||
the URI class:
|
||||
|
||||
use URI;
|
||||
my $url = URI->new( 'http://www.imdb.com/find' );
|
||||
# makes an object representing the URL
|
||||
|
||||
$url->query_form( # And here the form data pairs:
|
||||
'q' => 'Blade Runner',
|
||||
's' => 'all',
|
||||
);
|
||||
|
||||
my $response = $browser->get($url);
|
||||
|
||||
See chapter 5 of I<Perl & LWP> for a longer discussion of HTML forms
|
||||
and of form data, and chapters 6 through 9 for a longer discussion of
|
||||
extracting data from HTML.
|
||||
|
||||
|
||||
|
||||
=head2 Absolutizing URLs
|
||||
|
||||
The URI class that we just mentioned above provides all sorts of methods
|
||||
for accessing and modifying parts of URLs (such as asking sort of URL it
|
||||
is with C<< $url->scheme >>, and asking what host it refers to with C<<
|
||||
$url->host >>, and so on, as described in L<the docs for the URI
|
||||
class|URI>. However, the methods of most immediate interest
|
||||
are the C<query_form> method seen above, and now the C<new_abs> method
|
||||
for taking a probably-relative URL string (like "../foo.html") and getting
|
||||
back an absolute URL (like "http://www.perl.com/stuff/foo.html"), as
|
||||
shown here:
|
||||
|
||||
use URI;
|
||||
$abs = URI->new_abs($maybe_relative, $base);
|
||||
|
||||
For example, consider this program that matches URLs in the HTML
|
||||
list of new modules in CPAN:
|
||||
|
||||
use strict;
|
||||
use warnings;
|
||||
use LWP;
|
||||
my $browser = LWP::UserAgent->new;
|
||||
|
||||
my $url = 'http://www.cpan.org/RECENT.html';
|
||||
my $response = $browser->get($url);
|
||||
die "Can't get $url -- ", $response->status_line
|
||||
unless $response->is_success;
|
||||
|
||||
my $html = $response->decoded_content;
|
||||
while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
|
||||
print "$1\n";
|
||||
}
|
||||
|
||||
When run, it emits output that starts out something like this:
|
||||
|
||||
MIRRORING.FROM
|
||||
RECENT
|
||||
RECENT.html
|
||||
authors/00whois.html
|
||||
authors/01mailrc.txt.gz
|
||||
authors/id/A/AA/AASSAD/CHECKSUMS
|
||||
...
|
||||
|
||||
However, if you actually want to have those be absolute URLs, you
|
||||
can use the URI module's C<new_abs> method, by changing the C<while>
|
||||
loop to this:
|
||||
|
||||
while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
|
||||
print URI->new_abs( $1, $response->base ) ,"\n";
|
||||
}
|
||||
|
||||
(The C<< $response->base >> method from L<HTTP::Message|HTTP::Message>
|
||||
is for returning what URL
|
||||
should be used for resolving relative URLs -- it's usually just
|
||||
the same as the URL that you requested.)
|
||||
|
||||
That program then emits nicely absolute URLs:
|
||||
|
||||
http://www.cpan.org/MIRRORING.FROM
|
||||
http://www.cpan.org/RECENT
|
||||
http://www.cpan.org/RECENT.html
|
||||
http://www.cpan.org/authors/00whois.html
|
||||
http://www.cpan.org/authors/01mailrc.txt.gz
|
||||
http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
|
||||
...
|
||||
|
||||
See chapter 4 of I<Perl & LWP> for a longer discussion of URI objects.
|
||||
|
||||
Of course, using a regexp to match hrefs is a bit simplistic, and for
|
||||
more robust programs, you'll probably want to use an HTML-parsing module
|
||||
like L<HTML::LinkExtor> or L<HTML::TokeParser> or even maybe
|
||||
L<HTML::TreeBuilder>.
|
||||
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
=head2 Other Browser Attributes
|
||||
|
||||
LWP::UserAgent objects have many attributes for controlling how they
|
||||
work. Here are a few notable ones:
|
||||
|
||||
=over
|
||||
|
||||
=item *
|
||||
|
||||
C<< $browser->timeout(15); >>
|
||||
|
||||
This sets this browser object to give up on requests that don't answer
|
||||
within 15 seconds.
|
||||
|
||||
|
||||
=item *
|
||||
|
||||
C<< $browser->protocols_allowed( [ 'http', 'gopher'] ); >>
|
||||
|
||||
This sets this browser object to not speak any protocols other than HTTP
|
||||
and gopher. If it tries accessing any other kind of URL (like an "ftp:"
|
||||
or "mailto:" or "news:" URL), then it won't actually try connecting, but
|
||||
instead will immediately return an error code 500, with a message like
|
||||
"Access to 'ftp' URIs has been disabled".
|
||||
|
||||
|
||||
=item *
|
||||
|
||||
C<< use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new()); >>
|
||||
|
||||
This tells the browser object to try using the HTTP/1.1 "Keep-Alive"
|
||||
feature, which speeds up requests by reusing the same socket connection
|
||||
for multiple requests to the same server.
|
||||
|
||||
|
||||
=item *
|
||||
|
||||
C<< $browser->agent( 'SomeName/1.23 (more info here maybe)' ) >>
|
||||
|
||||
This changes how the browser object will identify itself in
|
||||
the default "User-Agent" line is its HTTP requests. By default,
|
||||
it'll send "libwww-perl/I<versionnumber>", like
|
||||
"libwww-perl/5.65". You can change that to something more descriptive
|
||||
like this:
|
||||
|
||||
$browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
|
||||
|
||||
Or if need be, you can go in disguise, like this:
|
||||
|
||||
$browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
|
||||
|
||||
|
||||
=item *
|
||||
|
||||
C<< push @{ $ua->requests_redirectable }, 'POST'; >>
|
||||
|
||||
This tells this browser to obey redirection responses to POST requests
|
||||
(like most modern interactive browsers), even though the HTTP RFC says
|
||||
that should not normally be done.
|
||||
|
||||
|
||||
=back
|
||||
|
||||
|
||||
For more options and information, see L<the full documentation for
|
||||
LWP::UserAgent|LWP::UserAgent>.
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Writing Polite Robots
|
||||
|
||||
If you want to make sure that your LWP-based program respects F<robots.txt>
|
||||
files and doesn't make too many requests too fast, you can use the LWP::RobotUA
|
||||
class instead of the LWP::UserAgent class.
|
||||
|
||||
LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:
|
||||
|
||||
use LWP::RobotUA;
|
||||
my $browser = LWP::RobotUA->new('YourSuperBot/1.34', 'you@yoursite.com');
|
||||
# Your bot's name and your email address
|
||||
|
||||
my $response = $browser->get($url);
|
||||
|
||||
But HTTP::RobotUA adds these features:
|
||||
|
||||
|
||||
=over
|
||||
|
||||
=item *
|
||||
|
||||
If the F<robots.txt> on C<$url>'s server forbids you from accessing
|
||||
C<$url>, then the C<$browser> object (assuming it's of class LWP::RobotUA)
|
||||
won't actually request it, but instead will give you back (in C<$response>) a 403 error
|
||||
with a message "Forbidden by robots.txt". That is, if you have this line:
|
||||
|
||||
die "$url -- ", $response->status_line, "\nAborted"
|
||||
unless $response->is_success;
|
||||
|
||||
then the program would die with an error message like this:
|
||||
|
||||
http://whatever.site.int/pith/x.html -- 403 Forbidden by robots.txt
|
||||
Aborted at whateverprogram.pl line 1234
|
||||
|
||||
=item *
|
||||
|
||||
If this C<$browser> object sees that the last time it talked to
|
||||
C<$url>'s server was too recently, then it will pause (via C<sleep>) to
|
||||
avoid making too many requests too often. How long it will pause for, is
|
||||
by default one minute -- but you can control it with the C<<
|
||||
$browser->delay( I<minutes> ) >> attribute.
|
||||
|
||||
For example, this code:
|
||||
|
||||
$browser->delay( 7/60 );
|
||||
|
||||
...means that this browser will pause when it needs to avoid talking to
|
||||
any given server more than once every 7 seconds.
|
||||
|
||||
=back
|
||||
|
||||
For more options and information, see L<the full documentation for
|
||||
LWP::RobotUA|LWP::RobotUA>.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
=head2 Using Proxies
|
||||
|
||||
In some cases, you will want to (or will have to) use proxies for
|
||||
accessing certain sites and/or using certain protocols. This is most
|
||||
commonly the case when your LWP program is running (or could be running)
|
||||
on a machine that is behind a firewall.
|
||||
|
||||
To make a browser object use proxies that are defined in the usual
|
||||
environment variables (C<HTTP_PROXY>, etc.), just call the C<env_proxy>
|
||||
on a user-agent object before you go making any requests on it.
|
||||
Specifically:
|
||||
|
||||
use LWP::UserAgent;
|
||||
my $browser = LWP::UserAgent->new;
|
||||
|
||||
# And before you go making any requests:
|
||||
$browser->env_proxy;
|
||||
|
||||
For more information on proxy parameters, see L<the LWP::UserAgent
|
||||
documentation|LWP::UserAgent>, specifically the C<proxy>, C<env_proxy>,
|
||||
and C<no_proxy> methods.
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
=head2 HTTP Authentication
|
||||
|
||||
Many web sites restrict access to documents by using "HTTP
|
||||
Authentication". This isn't just any form of "enter your password"
|
||||
restriction, but is a specific mechanism where the HTTP server sends the
|
||||
browser an HTTP code that says "That document is part of a protected
|
||||
'realm', and you can access it only if you re-request it and add some
|
||||
special authorization headers to your request".
|
||||
|
||||
For example, the Unicode.org admins stop email-harvesting bots from
|
||||
harvesting the contents of their mailing list archives, by protecting
|
||||
them with HTTP Authentication, and then publicly stating the username
|
||||
and password (at C<http://www.unicode.org/mail-arch/>) -- namely
|
||||
username "unicode-ml" and password "unicode".
|
||||
|
||||
For example, consider this URL, which is part of the protected
|
||||
area of the web site:
|
||||
|
||||
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
|
||||
|
||||
If you access that with a browser, you'll get a prompt
|
||||
like
|
||||
"Enter username and password for 'Unicode-MailList-Archives' at server
|
||||
'www.unicode.org'".
|
||||
|
||||
In LWP, if you just request that URL, like this:
|
||||
|
||||
use LWP;
|
||||
my $browser = LWP::UserAgent->new;
|
||||
|
||||
my $url =
|
||||
'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
|
||||
my $response = $browser->get($url);
|
||||
|
||||
die "Error: ", $response->header('WWW-Authenticate') || 'Error accessing',
|
||||
# ('WWW-Authenticate' is the realm-name)
|
||||
"\n ", $response->status_line, "\n at $url\n Aborting"
|
||||
unless $response->is_success;
|
||||
|
||||
Then you'll get this error:
|
||||
|
||||
Error: Basic realm="Unicode-MailList-Archives"
|
||||
401 Authorization Required
|
||||
at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
|
||||
Aborting at auth1.pl line 9. [or wherever]
|
||||
|
||||
...because the C<$browser> doesn't know any the username and password
|
||||
for that realm ("Unicode-MailList-Archives") at that host
|
||||
("www.unicode.org"). The simplest way to let the browser know about this
|
||||
is to use the C<credentials> method to let it know about a username and
|
||||
password that it can try using for that realm at that host. The syntax is:
|
||||
|
||||
$browser->credentials(
|
||||
'servername:portnumber',
|
||||
'realm-name',
|
||||
'username' => 'password'
|
||||
);
|
||||
|
||||
In most cases, the port number is 80, the default TCP/IP port for HTTP; and
|
||||
you usually call the C<credentials> method before you make any requests.
|
||||
For example:
|
||||
|
||||
$browser->credentials(
|
||||
'reports.mybazouki.com:80',
|
||||
'web_server_usage_reports',
|
||||
'plinky' => 'banjo123'
|
||||
);
|
||||
|
||||
So if we add the following to the program above, right after the C<<
|
||||
$browser = LWP::UserAgent->new; >> line...
|
||||
|
||||
$browser->credentials( # add this to our $browser 's "key ring"
|
||||
'www.unicode.org:80',
|
||||
'Unicode-MailList-Archives',
|
||||
'unicode-ml' => 'unicode'
|
||||
);
|
||||
|
||||
...then when we run it, the request succeeds, instead of causing the
|
||||
C<die> to be called.
|
||||
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
=head2 Accessing HTTPS URLs
|
||||
|
||||
When you access an HTTPS URL, it'll work for you just like an HTTP URL
|
||||
would -- if your LWP installation has HTTPS support (via an appropriate
|
||||
Secure Sockets Layer library). For example:
|
||||
|
||||
use LWP;
|
||||
my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
|
||||
my $browser = LWP::UserAgent->new;
|
||||
my $response = $browser->get($url);
|
||||
die "Error at $url\n ", $response->status_line, "\n Aborting"
|
||||
unless $response->is_success;
|
||||
print "Whee, it worked! I got that ",
|
||||
$response->content_type, " document!\n";
|
||||
|
||||
If your LWP installation doesn't have HTTPS support set up, then the
|
||||
response will be unsuccessful, and you'll get this error message:
|
||||
|
||||
Error at https://www.paypal.com/
|
||||
501 Protocol scheme 'https' is not supported
|
||||
Aborting at paypal.pl line 7. [or whatever program and line]
|
||||
|
||||
If your LWP installation I<does> have HTTPS support installed, then the
|
||||
response should be successful, and you should be able to consult
|
||||
C<$response> just like with any normal HTTP response.
|
||||
|
||||
For information about installing HTTPS support for your LWP
|
||||
installation, see the helpful F<README.SSL> file that comes in the
|
||||
libwww-perl distribution.
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
|
||||
=head2 Getting Large Documents
|
||||
|
||||
When you're requesting a large (or at least potentially large) document,
|
||||
a problem with the normal way of using the request methods (like C<<
|
||||
$response = $browser->get($url) >>) is that the response object in
|
||||
memory will have to hold the whole document -- I<in memory>. If the
|
||||
response is a thirty megabyte file, this is likely to be quite an
|
||||
imposition on this process's memory usage.
|
||||
|
||||
A notable alternative is to have LWP save the content to a file on disk,
|
||||
instead of saving it up in memory. This is the syntax to use:
|
||||
|
||||
$response = $ua->get($url,
|
||||
':content_file' => $filespec,
|
||||
);
|
||||
|
||||
For example,
|
||||
|
||||
$response = $ua->get('http://search.cpan.org/',
|
||||
':content_file' => '/tmp/sco.html'
|
||||
);
|
||||
|
||||
When you use this C<:content_file> option, the C<$response> will have
|
||||
all the normal header lines, but C<< $response->content >> will be
|
||||
empty. Errors writing to the content file (for example due to
|
||||
permission denied or the filesystem being full) will be reported via
|
||||
the C<Client-Aborted> or C<X-Died> response headers, and not the
|
||||
C<is_success> method:
|
||||
|
||||
if ($response->header('Client-Aborted') eq 'die') {
|
||||
# handle error ...
|
||||
|
||||
Note that this ":content_file" option isn't supported under older
|
||||
versions of LWP, so you should consider adding C<use LWP 5.66;> to check
|
||||
the LWP version, if you think your program might run on systems with
|
||||
older versions.
|
||||
|
||||
If you need to be compatible with older LWP versions, then use
|
||||
this syntax, which does the same thing:
|
||||
|
||||
use HTTP::Request::Common;
|
||||
$response = $ua->request( GET($url), $filespec );
|
||||
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
|
||||
=head1 SEE ALSO
|
||||
|
||||
Remember, this article is just the most rudimentary introduction to
|
||||
LWP -- to learn more about LWP and LWP-related tasks, you really
|
||||
must read from the following:
|
||||
|
||||
=over
|
||||
|
||||
=item *
|
||||
|
||||
L<LWP::Simple> -- simple functions for getting/heading/mirroring URLs
|
||||
|
||||
=item *
|
||||
|
||||
L<LWP> -- overview of the libwww-perl modules
|
||||
|
||||
=item *
|
||||
|
||||
L<LWP::UserAgent> -- the class for objects that represent "virtual browsers"
|
||||
|
||||
=item *
|
||||
|
||||
L<HTTP::Response> -- the class for objects that represent the response to
|
||||
a LWP response, as in C<< $response = $browser->get(...) >>
|
||||
|
||||
=item *
|
||||
|
||||
L<HTTP::Message> and L<HTTP::Headers> -- classes that provide more methods
|
||||
to HTTP::Response.
|
||||
|
||||
=item *
|
||||
|
||||
L<URI> -- class for objects that represent absolute or relative URLs
|
||||
|
||||
=item *
|
||||
|
||||
L<URI::Escape> -- functions for URL-escaping and URL-unescaping strings
|
||||
(like turning "this & that" to and from "this%20%26%20that").
|
||||
|
||||
=item *
|
||||
|
||||
L<HTML::Entities> -- functions for HTML-escaping and HTML-unescaping strings
|
||||
(like turning "C. & E. BrontE<euml>" to and from "C. & E. Brontë")
|
||||
|
||||
=item *
|
||||
|
||||
L<HTML::TokeParser> and L<HTML::TreeBuilder> -- classes for parsing HTML
|
||||
|
||||
=item *
|
||||
|
||||
L<HTML::LinkExtor> -- class for finding links in HTML documents
|
||||
|
||||
=item *
|
||||
|
||||
The book I<Perl & LWP> by Sean M. Burke. O'Reilly & Associates,
|
||||
2002. ISBN: 0-596-00178-9, L<http://oreilly.com/catalog/perllwp/>. The
|
||||
whole book is also available free online:
|
||||
L<http://lwp.interglacial.com>.
|
||||
|
||||
=back
|
||||
|
||||
|
||||
=head1 COPYRIGHT
|
||||
|
||||
Copyright 2002, Sean M. Burke. You can redistribute this document and/or
|
||||
modify it, but only under the same terms as Perl itself.
|
||||
|
||||
=head1 AUTHOR
|
||||
|
||||
Sean M. Burke C<sburke@cpan.org>
|
||||
|
||||
=for comment
|
||||
##########################################################################
|
||||
|
||||
=cut
|
||||
|
||||
# End of Pod
|
||||
Reference in New Issue
Block a user