The HTTP protocol is intended to give two parties a common method of communication: web clients and web servers. Clients are programs or scripts that send requests to servers. Examples of clients include web browsers, such as Internet Explorer and Mozilla Firefox, and crawlers, like those used by Yahoo! and Google to expand their search engine offerings. Servers are programs that run indefinitely and do nothing but receive and send responses to client requests. Popular examples include Microsoft IIS and the Apache HTTP Server. You must be familiar enough with the anatomy and nuances of HTTP requests and responses to do two things. First, you must be able to configure and use your preferred client to view requests and responses that pass between it and the server hosting the target application as you access it. This is essential to developing your web scraping application without expending an excessive amount of time and energy on your part. Second, you must be able to use most of the features offered by a PHP HTTP client library. Ideally, you would know HTTP and PHP well enough to build your own client library or fix issues with an existing one if necessary. In principle, however, you should resort to finding and using an adequate existing library first and constructing one that is reusable as a last resort. We will examine some of these libraries in the next few chapters. Supplemental ReferencesThis book will cover HTTP in sufficient depth as it relates to web scraping, but should not in any respect be considered a comprehensive guide on the subject. Here are a few recommended references to supplement the material covered in this book.
GET RequestsLet’s start with a very simple HTTP request, one to retrieve the main landing page of the Wikipedia web site in English. GET /wiki/Main_Page HTTP/1.1 Host: The individual components of this request are as follows.
GET is by far the most commonly used operation in the HTTP protocol. According to the HTTP specification, the intent of GET is to request a representation of a resource, essentially to “read" it as you would a file on a file system. Common examples of formats for such representations include HTML and XML-based formats such as XHTML, RSS, and Atom. In principle, GET should not modify any existing data exposed by the application. For this reason, it is considered to be what is called a safe operation. ft is worth noting that as you examine your target applications, you may encounter situations where GET operations are used incorrectly to modify data rather than simply returning it. This indicates poor application design and should be avoided when developing your own applications. Anatomy of a URLIf you aren’t already familiar with all the components of a URL, this will likely be useful in later chapters.
Query StringsAnother provision of URLs is a mechanism called the query string that is used to pass request parameters to web applications. Below is a GET request that includes a query string and is used to request a form to edit a page on Wikipedia. GET /w/index.php?title=Query_string&action=edit Host: There are a few notable traits of this URL.
Query strings are not specific to GET operations and can be used in other operations as well. Speaking of which, let’s move on.
POST RequestsThe next most common HTTP operation after GET is POST, which is used to submit data to a specified resource. When using a web browser as a client, this is most often done via an HTML form. POST is intended to add to or alter data exposed by the application, a potential result of which is that a new resource is created or an existing resource is changed. One major difference between a GET request and a POST request is that the latter includes a body following the request headers to contain the data to be submitted. POST /w/index.php?title=Wikipedia:Sandbox&action=submit HTTP/1.1 Host: wpAntispam=&wpSection=&wpStarttime=20080719022313&wpEdittime=200807190 22100&&wpSc rolltop=&wpTextbox1=%7B%7BPlease+leave+this+line+alone+%28s andbox+heading%29%7D%7D+%3C%21--+Hello%21+Feel+f ree+to+try+you r+format ting+and+editing+skills+below+this+line.+As+this+page+is+for+editing+e xperiments%2C+this+page+will+automatically+be+cleaned+every+12+hours.+ --%3E+&wpSumma ry=&wpAutoSumma ry=d41d8cd98f00b204e9800998ecf8427e&wpSav e=Save+page&wpPreview=Show+preview&wpDiff=Show+changes&wpEditToken=%5C%2B A single blank line separates the headers from the body. The body should look familiar, as it is formatted identically to the query string with the exception that it is not prefixed with a question mark.
HEAD RequestsThough not common when accessing target web applications, HEAD requests are useful in web scraping applications in several ways. They function in the same way as a GET request with one exception: when the server delivers its response, it will not deliver the resource representation that normally comprises the response body. The reason this is useful is that it allows a client to get at the data present in the response headers without having to download the entire response, which is liable to be significantly larger. Such data can include whether or not the resource is still available for access and, if it is, when it was last modified. HEAD /wiki/Main_Page HTTP/1.1 Host: Speaking of responses, now would be a good time to investigate those in more detail.
© HTTP — Web Scraping >>> Back to TABLE OF CONTENTS <<< | |
Views: 547 | |
Total comments: 0 | |