Web Scraping Defined

Web scraping is a process involving the retrieval a semi-structured document from the internet, generally a web page in a markup language such as HTML or XHTML, and analysis of that document in order to extract specific data from it for use in another context. It is commonly (though not entirely accurately) also known as screen scraping. Web scraping does not technically fall within the field of data mining because the latter implies an attempt to discern semantic patterns or trends in large data sets that have already been obtained. Web scraping applications (also called intelligent, automated, or autonomous agents) are concerned only with obtaining the data itself through retrieval and extraction and can involve data sets of significantly varied sizes.

You might be saying to yourself that web scraping sounds a lot like acting as a client for a web service. The difference is in the intended audience of the document and, by proxy, the document’s format and structure. Web services, because of their intended purpose, are inherently bound by the requirement to generate valid markup in order to remain useful. They must maintain consistent standard formatting in order for machines to be capable of parsing their output.

Web browsers, on the other hand, are generally a lot more forgiving about handling visual rendering of a document when its markup is not valid. As well, web browsers are intended for human use and the methods in which they consume information do not always fall parallel to the way machines would consume it when using an equivalent web service. This can make development of web scraping applications difficult in some instances. Like the obligation of a web service to generate valid markup, a web browser has certain responsibilities. These include respecting server requests to not index certain pages and keeping the number of requests sent to servers within a reasonable amount.

In short, web scraping is the subset of a web browser’s functionality necessary to obtain and render data in a manner conducive to how that data will be used.


© Introduction — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (30.08.2014)
Views: 347 | Rating: 0.0/0
Total comments: 0
avatar