Batch Jobs

Web scraping applications intended for long-term use generally function in one of two ways: real-time or batch.

A web scraping application implemented using the real-time approach will receive a request and send a request out to the target application being scraped in order to fulfill the original request. There are two advantages to this. First, any data pulled from the target application will be current. Second, any data pushed to the target application will be reflected on that site in nearly the same amount of time it would take if the data had been pushed directly to the target application. This approach has the disadvantage of increasing the response time of the web scraping application, since the client essentially has to wait for two requests to complete for every one request that would normally be made to the target application.

The batch approach is based on synchronization. For read operations, data is updated on a regular interval. For write operations, changes are stored locally and then pushed out in batches (hence the name) to the target application, also on a regular interval. The pros and cons to this approach are the complement of those from the real-time approach: updates will not be real-time, but the web scraping application’s response time will not be increased. ft is of course possible to use a batch approach with a relatively low interval in order to approximate real-time while gaining the benefits of the batch approach.

The selection of an approach depends on the requirements of the web scraping application. In general, if real-time updates on either the web scraping application or target application are not required, the batch approach is preferred to maintain a high level of performance.


© Tips and Tricks — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 314 | Rating: 0.0/0
Total comments: 0
avatar