Testing

As important as unit testing is to quality assurance for an application, it’s all the more important to a web scraping application because it’s reliant on its target to remain unchanged. Queries of markup documents must be checked to assert that they produce predictable results and data extracted from those markup documents must be validated to ensure that it is consistent with expectations. In the case of real-time applications, HTTP responses must also be checked to ensure that the target application is accessible and has not changed drastically such that resources are no longer available at their original locations.

During development, it’s advisable to download local copies of documents to be scraped and include them as part of the unit test suite as it’s developed. Additionally, the test suite should include two working modes: local and remote. The former case would perform tests on the aforementioned local document copies while the latter would download the documents from the target site in real-time. In the event that any areas of the web scraping application stop functioning as expected, contrasting the results of these two working modes can be very helpful in determining the cause of the issue.

PHPUnit is among the most popular of PHP unit testing frameworks available. See http://phpunit.de for more information. Among its many features is the option to output test results to a file in XML format. This feature and others similar to it in both PHPUnit and other unit testing frameworks is very useful in producing results that can be ported to another data medium and made accessible to the web scraping application itself. This facilitates the ability to temporarily restrict or otherwise disable functionality in that application should tests relevant to said functionality fail.

The bottom line is this: debugging a web application is like trying to kill a moving housefly with a pea shooter. It’s important to make locating the cause of an issue as easy as possible to minimize the turn-around time required to update the web scraping application to accommodate for it. Test failures should alert developers and lock down any sensitive application areas to prevent erroneous transmission, corruption, or deletion of data.


© Tips and Tricks — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 332 | Rating: 0.0/0
Total comments: 0
avatar