Requests

The HTTP protocol is intended to give two parties a common method of communication: web clients and web servers. Clients are programs or scripts that send requests to servers. Examples of clients include web browsers, such as Internet Explorer and Mozilla Firefox, and crawlers, like those used by Yahoo! and Google to expand their search engine offerings. Servers are programs that run indefinitely and do nothing but receive and send responses to client requests. Popular examples include Microsoft IIS and the Apache HTTP Server.

You must be familiar enough with the anatomy and nuances of HTTP requests and responses to do two things. First, you must be able to configure and use your preferred client to view requests and responses that pass between it and the server hosting the target application as you access it. This is essential to developing your web scraping application without expending an excessive amount of time and energy on your part.

Second, you must be able to use most of the features offered by a PHP HTTP client library. Ideally, you would know HTTP and PHP well enough to build your own client library or fix issues with an existing one if necessary. In principle, however, you should resort to finding and using an adequate existing library first and constructing one that is reusable as a last resort. We will examine some of these libraries in the next few chapters.

Supplemental References

This book will cover HTTP in sufficient depth as it relates to web scraping, but should not in any respect be considered a comprehensive guide on the subject. Here are a few recommended references to supplement the material covered in this book.

  • RFC 2616    HyperText    Transfer    Protocol - HTTP/1.1
    (http://www.ietf.org/rfc/rfc2616.txt)

  • RFC 3986 Uniform Resource Identifiers (URI): Generic Syntax (http://www.ietf.org/rfc/rfc3986.txt)

  • HTTP: The Definitive Guide" (ISBN 1565925092)

  • "HTTP Pocket Reference: HyperText Transfer Protocol" (ISBN 1565928628)

  • “HTTP Developer’s Handbook" (ISBN 0672324547)

  • Ben Ramsey’s blog series on HTTP (http://benramsey.com/http-status-codes)

GET Requests

Let’s start with a very simple HTTP request, one to retrieve the main landing page of the Wikipedia web site in English.

GET /wiki/Main_Page HTTP/1.1

Host: en.wikipedia.org

 The individual components of this request are as follows.

  • GET is the method or operation. Think of it as a verb in a sentence, an action that you want to perform on something. Other examples of methods include POST and HEAD. These will be covered in more detail later in the chapter.
  • /wiki/Main_Page is the Uniform Resource Identifier or URI. It provides a unique point of reference for the resource, the object or target of the operation.
  • HTTP/1.1 specifies the HTTP protocol version in use by the client, which will be detailed further a little later in this chapter.
  • The method, URL, and HTTP version collectively make up the request line, which ends with a <CR> <LF> (carriage return-line feed) sequence, which corresponds to ASCII characters 13 and 10 or Unicode characters U+000D and U+000A respectively. (See RFC 2616 Section 2.2 for more information.)
  • A single header Host and its associated value en.wikipedia .org follow the request line. More header-value pairs may follow.
  • Based on the resource, the value of the Host header, and the protocol in use (HTTP, as opposed to HTTPS or HTTP over SSL), http://en.wikipedia.org/wiki/Main_Page is the resulting full URL of the requested resource.

 

URI vs URL

URI is sometimes used interchangeably with URL, which frequently leads to confusion about the exact nature of either. A URI is used to uniquely identify a resource, indicate how to locate a resource, or both. URL is the subset of URI that does both (as opposed to either) and is what makes them usable by humans. After all, what’s the use of being able to identify a resource if you can’t access it! See sections 1.1.3 and 1.2.2 of RFC 3986 for more information.

GET is by far the most commonly used operation in the HTTP protocol. According to the HTTP specification, the intent of GET is to request a representation of a resource, essentially to “read" it as you would a file on a file system. Common examples of formats for such representations include HTML and XML-based formats such as XHTML, RSS, and Atom.

In principle, GET should not modify any existing data exposed by the application. For this reason, it is considered to be what is called a safe operation. ft is worth noting that as you examine your target applications, you may encounter situations where GET operations are used incorrectly to modify data rather than simply returning it. This indicates poor application design and should be avoided when developing your own applications.

Anatomy of a URL

If you aren’t already familiar with all the components of a URL, this will likely be useful in later chapters.

http://user:pass@www.domain.com:8080/path/to/file.ext?query=&var=value#anchor
  • http is the protocol used to interact with the resource. Another example is https, which is equivalent to http on a connection using an SSL certificate for encryption.
  • user:pass@ is an optional component used to instruct the client that Basic HTTP authentication is required to access the resource and that user and pass should be used for the username and password respectively when authenticating. HTTP authentication will be covered in more detail toward the end of this chapter.
  • :8080 is another optional segment used to instruct the client that 8080 is the port on which the web server listens for incoming requests. In the absence of this segment, most clients will use the standard HTTP port 80.
  • /path/to/fite.ext specifies the resource to access.
  • query=&var=value is the query string, which will be covered in more depth in the next section.
  • #anchor is the fragment, which points to a specific location within or state of the current resource.

Query Strings

Another provision of URLs is a mechanism called the query string that is used to pass request parameters to web applications. Below is a GET request that includes a query string and is used to request a form to edit a page on Wikipedia.

GET /w/index.php?title=Query_string&action=edit
Host: en.wikipedia.org

There are a few notable traits of this URL.

  • A question mark denotes the end of the resource path and the beginning of the query string.
  • The query string is composed of key-value pairs where each pair is separated by an ampersand.
  • Keys and values are separated by an equal sign.

Query strings are not specific to GET operations and can be used in other operations as well. Speaking of which, let’s move on.

Query String Limits

Most mainstream browsers impose a limit on the maximum character length of a query string. There is no standardized value for this, but Internet Explorer 7 appears to hold the current least common denominator of 2,047 bytes at the time of this writing. Querying a search engine should turn up your preferred browser’s limit. It’s rare for this to become an issue during development, but it is a circumstance worth knowing.

POST Requests

The next most common HTTP operation after GET is POST, which is used to submit data to a specified resource. When using a web browser as a client, this is most often done via an HTML form. POST is intended to add to or alter data exposed by the application, a potential result of which is that a new resource is created or an existing resource is changed. One major difference between a GET request and a POST request is that the latter includes a body following the request headers to contain the data to be submitted.

POST /w/index.php?title=Wikipedia:Sandbox&action=submit HTTP/1.1 Host: en.wikipedia.org

wpAntispam=&wpSection=&wpStarttime=20080719022313&wpEdittime=200807190 22100&&wpSc rolltop=&wpTextbox1=%7B%7BPlease+leave+this+line+alone+%28s andbox+heading%29%7D%7D+%3C%21--+Hello%21+Feel+f ree+to+try+you r+format ting+and+editing+skills+below+this+line.+As+this+page+is+for+editing+e xperiments%2C+this+page+will+automatically+be+cleaned+every+12+hours.+
--%3E+&wpSumma ry=&wpAutoSumma ry=d41d8cd98f00b204e9800998ecf8427e&wpSav e=Save+page&wpPreview=Show+preview&wpDiff=Show+changes&wpEditToken=%5C%2B

A single blank line separates the headers from the body. The body should look familiar, as it is formatted identically to the query string with the exception that it is not prefixed with a question mark.

URL Encoding

One trait of query strings is that parameter values are encodedusing percent-encoding or, as it’s more commonly known, URL encoding. The PHP functions urlencode and u rldecode are a convenient way to handle string values encoded in this manner. Most HTTP client libraries handle encoding request parameters for you. Though it’s called URL encoding, the technical details for it are actually more closely associated with the URI as shown in section 2.1 of RFC 3986.

HEAD Requests

Though not common when accessing target web applications, HEAD requests are useful in web scraping applications in several ways. They function in the same way as a GET request with one exception: when the server delivers its response, it will not deliver the resource representation that normally comprises the response body. The reason this is useful is that it allows a client to get at the data present in the response headers without having to download the entire response, which is liable to be significantly larger. Such data can include whether or not the resource is still available for access and, if it is, when it was last modified.

HEAD /wiki/Main_Page HTTP/1.1
Host: en.wikipedia.org

Speaking of responses, now would be a good time to investigate those in more detail.

 


© HTTP — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (30.08.2014)
Views: 479 | Rating: 0.0/0
Total comments: 0
avatar