Topics Covered

You’re obviously reading chapter 1 now, which provides a brief introduction to web scraping, answers common questions, and leads into the meat of the book.

  • Chapter 2 deals with relevant details of the HTTP protocol, as HTTP clients are used in the process of document retrieval. This includes how requests and responses are structured and various headers that are used in each to implement features such as cookies, HTTP authentication, redirection, and more.
  • Chapters 3-7 cover specific PHP HTTP client libraries and their features, usage, and advantages and disadvantages of each.

  • Chapter 8 goes into developing a custom client library and common concerns when using any library including prevention of throttling, access randomization, agent scheduling, and side effects of client-side scripts.

  • Chapter 9 details use of the tidy extension for correcting issues with retrieved markup prior to using other extensions to analyze it.

  • Chapters 10-12 review various XML extensions for PHP, compare and contrast the two classes of XML parsers, and provide a brief introduction to XPath.

  • Chapter 13 is a study of CSS selectors, comparisons between them and XPath expressions, and information on available libraries for using them to query markup documents.

  • Chapter 14 explores regular expressions using the PCRE extension, which can be useful in validating scraped data to ensure the stability of the web scraping application.

  • Chapter 15 outlines several general high-level strategies and best practices for designing and developing your web scraping applications. 


© Introduction — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (30.08.2014)
Views: 363 | Rating: 0.0/0
Total comments: 0
avatar