Sending Requests

In addition to wrappers for specific protocols, the streams extension also offers socket transports for dealing with data at a lower level. One of these socket transports is for TCR or Transmission Control Protocol, which is a core internet protocol used to ensure reliable delivery of an ordered sequence of bytes. The socket transport facilitates sending a raw data stream, in this case a manually constructed HTTP request, to a server.

<?php
$stream = stream_socket_client('tcp://localhost.example:80');
$request = "GET / HTTP/1.1\r\nHost: localhost.example\r\n\r\n";
fwrite($stream, $request);
echo stream_get_contents($stream);
fclose($stream);

/*
Example output:
HTTP/1.1 200 OK
Date: Wed, 21 Jan 2009 03:16:43 GMT
Server: Apache/2.2.9 (Ubuntu) PHP/5.2.6-2ubuntu4 with Suhosin-Patch
X-Powered-By: PHP/5.2.6-2ubuntu4
Vary: Accept-Encoding
Content-Length: 12
Connection: close
Content-Type: text/html
Hello world!
*/
?>
  • The stream_socket_client function is used to establish a connection with the server, returning a connection handle resource assigned to $stream.
  • tcp:// specifies the transport to use.
  • localhost.example is the hostname of the server.
  • :80 specifies the port on which to connect to the server, in this case 80 because it is the standard port for HTTP. The port to use depends on the configuration of the web server.
  • $request contains the request to be sent to the server, where individual lines are separated with a CRLF sequence (see Chapter 2 “GET Requests”) and the request ends with a double CRLF sequence (effectively a blank line) to indicate to the server that the end of the request has been reached. Note that the request must contain the ending sequence or the web server will simply hang waiting for the rest of the request.
  • The fwrite function is used to transmit the request over the established connection represented by $stream.
  • The stream_get_contents function is used to read all available data from the connection, in this case the response to the request.
  • The fctose function is used to explicitly terminate the connection.

Depending on the nature and requirements of the project, not all facets of a request may be known at one time. In this situation, it is desirable to encapsulate request metadata in a data structure such as an associative array or an object. From this, a central unit of logic can be used to read that metadata and construct a request in the form of a string based on it.

Manually constructing requests within a string as shown in the example above also doesn’t have ideal readability. If exact requests are known ahead of time and do not vary, an alternative approach is storing them in a data source of some type, then retrieving them at runtime and sending them over the connection as they are. Whether it is possible to take this approach depends on the level of variance in requests going between the web scraping application and the target application.

If the need arises to manually build query strings or URL-encoded POST request bodies, the http_buitd_query function allows this to be done using associative arrays.


© Rolling Your Own — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (01.09.2014)
Views: 357 | Rating: 0.0/0
Total comments: 0
avatar