Crawlers

Some web scraping applications are intended to serve as crawlers to index content from web sites. Like all other web scraping applications, the work they perform can be divided into two categories: retrieval and analysis. The parallel processing approach is applicable here because each category of work serve to populate the work queue of the other.

The retrieval process is given one or more initial documents to retrieve. Each time a document is retrieved, it becomes a job for the analysis process, which scrapes the markup searching for links (a elements) to other documents, which may be restricted by one or more relevancy factors. Once analysis of a document is complete, addresses to any currently unretrieved documents are then fed back to the retrieval process.

This situation of mutual supply will hypothetically be sustained until no documents are found that are unindexed or considered to be relevant. At that point, the process can be restarted with the retrieval process using appropriate request headers to check for document updates and feeding documents to the analysis process where updates are found.

Category: Article | Added by: Marsipan (03.09.2014)

Views: 381 | Rating: 0.0/0

Total comments: 0