Headers

An all-purpose method of communicating a variety of information related to requests and responses, headers are used by the client and server to accomplish a number of things including retention of state using cookies and identity verification using HTTP authentication. This section will deal with those that are particularly applicable to web scraping applications. For more information, see section 14 of RFC 2616.

Cookies

HTTP is designed to be a stateless protocol. That is, once a server returns the response for a request, it effectively “forgets" about the request. It may log information about the request and the response it delivered, but it does not retain any sense of state for the same client between requests. Cookies are a method of circumventing this using headers. Here is how they work.

  • The client issues a request.
  • In its response, the server includes a Set-Cookie header. The header value is comprised of name-value pairs each with optional associated attribute-value pairs.
  • In subsequent requests, the client will include a Cookie header that contains the data it received in the Set-Cookie response header.

Cookies are frequently used to restrict access to certain content, most often by requiring some form of identity authentication before the target application will indicate that a cookie should be set. Most client libraries have the capability to handle parsing and resending cookie data as appropriate, though some require explicit instruction before they will do so. For more information on cookies, see RFC 2109 or its later (though less widely adopted) rendition RFC 2965.

One of the aforementioned attributes, “expires," is used to indicate when the client should dispose of the cookie and not persist its data in subsequent requests. This attribute is optional and its presence or lack thereof is the defining factor in whether or not the cookie is what’s called a session cookie. If a cookie has no expiration value set, it will persist for the duration of the client session. For normal web browsers, this is generally when all instances of the browser application have been closed.

Redirection

The Location header is used by the server to redirect the client to a URI. In this scenario, the response will most likely include a 3xx class status code (such as 302 Found), but may also include a 201 code to indicate the creation of a new resource. See subsection 14.30 of RFC 2616 for more information.

It is hypothetically possible for a malfunctioning application to cause the server to initiate an infinite series of redirections between itself and the client. For this reason, client libraries often implement a limit on the number of consecutive redirections it will process before assuming that the application being accessed is behaving inappropriately and terminating. Libraries generally implement a default limit, but allow you to override it with your own.

Referring URLs

It is possible for a requested resource to refer to other resources in some way. When this happens, clients traditionally include the URL of the referring resource in the Referer header. Yes, the header name is misspelled there and intentionally so. The commonality of that particular misspelling caused it to end up in the official HTTP specification, thereby becoming the standard industry spelling used when referring to that particular header.

There are multiple situations in which the specification of a referer can occur. A user may click on a hyperlink in a browser, in which case the full URL of the resource containing the hyperlink would be the referer. When a resource containing markup with embedded images is requested, subsequent requests for those images will contain the full URL of the page containing the images as the referer. A referer is also specified when redirection occurs, as described in the previous section.

The reason this is relevant is because some applications depend on the value of the Referer header by design, which is less than ideal for the simple fact that the header value can be spoofed. In any case, it is important to be aware that some applications may not function as expected if the provided header value is not consistent with what is sent when the application is used in a browser. See subsection 14.36 of RFC 2616 for more information.

Persistent Connections

The standard operating procedure for an HTTP request is as follows.

  • A client connects to a server.
  • The client sends a request over the established connection.
  • The server returns a response.
  • The connection is terminated.

When sending multiple consecutive requests to the same server, however, the first and fourth steps in that process can cause a significant amount of overhead. HTTP

1.0    established no solution for this; one connection per request was normal behavior. Between the releases of the HTTP 1.0 and 1.1 standards, a convention was informally established that involved the client including a Connection header with a value of Keep-Alive in the request to indicate to the server that a persistent connection was desired.

Later, 1.1 was released and changed the default behavior from one connection per request to persist connections. For a non-persistent connection, the client could include a Connection header with a value of close to indicate that the server should terminate the connection after it sent the response. The difference between 1.0 and 1.1 is an important distinction and should be a point of examination when evaluating both client libraries and servers hosting target applications so that you are aware of how they will behave with respect to persistent connections. See subsection 8.1 of RFC 2616 for more information.

There is an alternative implementation that gained significantly less support in clients and servers involving the use of a Keep-Alive header. Technical issues with this are discussed in subsection 19.7.1 of RFC 2068, but explicit use of this header should be avoided. It is mentioned here simply to make you aware that it exists and is related to the matter of persistent connections.

Content Caching

Two methods exist to allow clients to query servers in order to determine if resources have been updated since the client last accessed them. Subsections of RFC 2616 section 14 detail related headers.

The first method is time-based where the server returns a Last-Modified header (subsection 29) in its response and the client can send that value in an If-Modified-Since header (subsection 25) in a subsequent request for the same resource.

The other method is hash-based where the server sends a hash value in its response via the ETag header (subsection 19) and the client may send that value in an If-None-Match header (subsection 26) in a subsequent request for the same resource.

If the resource has not changed in either instance, the server simply returns a 304 Not Modified response. Aside from checking to ensure that a resource is still available (which will result in a 404 response if it is not), this is an appropriate situation in which to use a HEAD request.

Alternatively, the logic of the first method can be inverted by using an If-Unmodified-Since header (subsection 28), in which case the server will return a 412 Precondition Failed response if the resource has in fact been modified since the provided access time.

User Agents

Clients are sometimes referred to as user agents. This refers to the fact that web browsers are agents that act on behalf of users in order to require minimal intervention on the user’s part. The User-Agent header enables the client to provide information about itself, such as its name and version number. Crawlers often use it to provide a URL for obtaining more information about the crawler or the e-mail address of the crawler’s operator. A simple search engine query should reveal a list of user agent strings for mainstream browsers. See subsection 14.43 of RFC 2616 for more information.

Unfortunately, some applications will engage in a practice known as user agent sniffing or browser sniffing in which they vary the responses they deliver displayed based on the user agent string provided by the client. This can include completely 

disabling a primary site feature, such as an e-commerce checkout page that uses ActiveX (a technology specific to Windows and Internet Explorer).

One well-known application of this technique is the robots exclusion standard, which is used to explicitly instruct crawlers to avoid accessing individual resources or the entire web site. More information about this is available at http://www. robotstxt.org. The guidelines detailed there should definitely be accounted for when developing a web scraping application so as to prevent it from exhibiting behavior inconsistent with that of a normal user.

In some cases, a client practice called user agent spoofing involving the specification of a false user agent string is enough to circumvent user agent sniffing, but not always. An application may have platform-specific requirements that legitimately warrant it denying access to certain user agents. In any case, spoofing the user agent is a practice that should be avoided to the fullest extent possible.

Ranges

The Range request header allows the client to specify that the body of the server’s response should be limited to one or more specific byte ranges of what it would normally be. Originally intended to allow failed retrieval attempts to resume from their stopping points, this feature can allow you to minimize data transfer between your application and the server to reduce bandwidth consumption and runtime of your web scraping application.

This is applicable in cases where you have a good rough idea of where your target data is located within the document, especially if the document is fairly large and you only need a small subset of the data it contains. However, using it does add one more variable to the possibility of your application breaking if the target site changes and you should bear that in mind when electing to do so.

While the format of the header value is being left open to allow for other range units, the only unit supported by HTTP/1.1 is bytes. The client and server may both use the Accept-Ranges header to indicate what units they support. The server will include the range (in a slightly different format) of the full response body in which the partial response body is located using the Content-Range header.

In the case of bytes, the beginning of the document is represented by 0. Ranges use inclusive bounds. For example, the first 500 bytes of a document would be specified as 0-499. To specify from a point to the end of the document, simply exclude the later bound. The portion of a document beginning from the byte 500 going to its end is represented as 500-.

If a range is specified that is valid with respect to the resource being requested, the server should return a 206 Partial Content response. Otherwise, it should return a 416 Requested Range Not Satisfiable response. See sections 14.35 and 14.16 of RFC 2616 for more information on the Range and Content-Range headers respectively.

Basic HTTP Authentication

Another less frequently used method of persisting identity between requests is HTTP authentication. Most third-party clients offer some form of native support for it. It’s not commonly used these days, but it’s good to be aware of how to derive the appropriate header values in cases where you must implement it yourself. For more information on HTTP authentication, see RFC 2617.

HTTP authentication comes in several “flavors,” the more popular two being Basic (unencrypted) and Digest (encrypted). Basic is the more common of the two, but the process for both goes like this.

  • A client sends a request without any authentication information.
  • The server sends a response with a 401 status code and a WWW-Authenticate header.
  • The client resends the original request, but this time includes an Autho rization header including the authentication credentials.
  • The server either sends a response indicating success or one with a 403 status code indicating that authentication failed.

In the case of Basic authentication, the value of the Authorization header will be the word Basic followed by a single space and then by a Base64-encoded sequence derived from the username-password pair separated by a colon. If, for example, the username is bigbadwolf and the password is letmein then the value of the header would be Basic YmlnYmFkd29sZj psZXRtZWlu where the Base64-encoded version of the string bigbadwolf: letmein is what follows Basic.

Digest HTTP Authentication

Digest authentication is a bit more involved. The WWW-Authenticate header returned by the server will contain the word “Digest" followed by a single space and then by a number of key-value pairs in the format key=“value" separated by commas. Below is an example of this header.

WWW-Authenticate: Digest realm="testrealm@host.com",
qop="auth,auth-int",
nonce="dcd98b7102dd2f0e8b11d0f600bfb0c093",
opaque="5ccc069c403ebaf9f0171e9517f40e41"

The client must respond with a specific response value that the server will verify before it allows the client to proceed. To derive that value requires use of the MD5 hash algorithm, which in PHP can be accessed using the md5 or hash functions. Here is the process.

  • Concatenate the appropriate username, the value of the realm key provided by the server, and the appropriate password together separated by colons and take the MD5 hash of that string. We’ll call this HA1. It shouldn't change for the rest of the session.
    <?php
    $ha1 = md5($username . ':testrealm@host.com:' . $password);
    ?>
  • Concatenate the method and URI of the original request separated by a colon and take the MD5 hash of that string. We’ll call this HA2. This will obviously vary with your method or URI.

    <?php
    $ha1 = md5($username . ':testrealm@host.com:' . $password);
    ​?>
    
  • Initialize a request counter that we’ll call nc with a value of 1. The value of this counter will need to be incremented and retransmitted with each subsequent request to uniquely identify it. Retransmitting a request counter value used in a previous request will result in the server rejecting it. Note that the value of this counter will need to be expressed to the server as a hexadecimal number. The dechex PHP function is useful for this.

    <?php
    $nc = 1;
    ?>
  •  

    Generate a random hash using the aforementioned hashing functions that we’ll call the client nonce or cnonce. The time and rand functions maybe useful here. This can (and probably should) be regenerated and resent with each request.

     

    <?php
    $cnonce = md5($_SERVER['REMOTE_ADDR'] . microtime(true));
    ?>
  • Take note of the value of the nonce key provided by the server, also known as the server nonce. We’ll refer to this as simply the nonce. This is randomly generated by the server and will expire after a certain period of time, at which point the server will respond with a 401 status code. It will modify the WWW-Authenticate header it returns in two noteworthy ways: 1) the key-value pair state=TRUE will be added; 2) the nonce value will be changed. When this happens, simply rederive the response code as shown below with the new nonce value and resubmit the original request (not forgetting to increment the request counter).

  • Concatenate HA1, the server nonce (nonce), the current request counter (nc) value, the client nonce you generated (cnonce), an appropriate value (most likely “auth”) from the comma-separated list contained in the qop (quality of protection) key provided by the server, and HA2 together separated by colons and take the MD5 hash of that string. This is the final response code.

    <?php
    $response = imptode(':', array(
    $ha1,
    $nonce,
    dechex($nc),
    $cnonce,
    'auth',
    $ha2
    ));
    ?>
  • Lastly, send everything the server originally sent in the WWW-Authenticate header, plus the response value and its constituents (except the password obviously), back to the server in the usual Authorization header.

    Authorization: Digest username="USERNAME",
    reatm="test reatm@host.com",
    nonce="dcd98b7102dd2f0e8b11d0f600bfb0c093",
    uri="/wiki/Main_Page",
    qop="auth",
    nc=00000001,
    cnonce="0a4f113b",
    response="6629fae49393a05397450978507c4ef1",
    opaque="5ccc069c403ebaf9f0171e9517f40e41"

Some third-party clients implement this, some don't. Again, it's not commonly used, but it's good to be aware of how to derive the appropriate header value in cases where you must implement it yourself. For more information on HTTP authentication, see RFC 2617.

 


© HTTP — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (30.08.2014)
Views: 516 | Rating: 0.0/0
Total comments: 0
avatar