Timing

Each time a web server receives a request, a separate line of execution in the form of a process or thread is created or reused to deal with that request. Left unchecked, this could potentially cause all resources on a server to be consumed by a large request load. As such, web servers generally restrict the number of requests they can handle concurrently. A request beyond this limit would be blocked until the server completed an existing request for which it had already allocated resources. Requests left unserved too long eventually time out.

Throttling is a term used to describe a client overloading a server with requests to the point where it consumes the available resource pool and thereby delays or prevents the processing of requests, potentially including requests the client itself sent last. Obviously, it’s desirable to avoid this behavior for two reasons: 1) it can be construed as abuse and result in your IP being banned from accessing the server; 2) it prevents the client from being consistently functional.

Most web browsers will establish a maximum of four concurrent connections per domain name when loading a given resource and its dependencies. As such, this is a good starting point for testing the load of the server hosting the target application. When possible, measure the response times of individual requests and compare that to the number of concurrent requests being sent to determine how much of a load the server can withstand.

Depending on the application, real-time interaction with the target application may not be necessary. If interaction with the target application and the data it handles will be limited to the userbase of your web scraping application, it may be possible to retrieve data as necessary, cache it locally, store modifications locally, and push them to the target application in bulk during hours of non-peak usage. To discern what these hours are, observe response time with respect to the time of day in which requests are made to locate the time periods during which response times are consistently highest.


© Rolling Your Own — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (01.09.2014)
Views: 329 | Rating: 0.0/0
Total comments: 0
avatar