Types of Parsers

Before going much further, you should be aware that there are two types of XML parsers: tree parsers and pull parsers. Tree parsers load the entire document into memory and allow you to access any part of it at any time as well as manipulate it. Pull parsers read the document a piece at a time and limit you to working with the current piece being read.

The two types of parsers share a relationship similar to that between the file_get_contents and fgets functions: the former lets you work with the entire document at once and uses as much memory needed to store it, while the latter allows you to work with a piece of the document at a time and use less memory in the process.

When working with fairly large documents, lower memory usage is generally the preferable option. Attempting to load a huge document into memory all at once has the same effect on the local system as a throttling client does on a web server: in both cases, resources are consumed and system performance is debilitated until the system eventually locks up or crashes under the stress.

The DOM extension is a tree parser. In general, web scraping does not require the ability to access all parts of the document simultaneously. However, the type of data extraction involved in web scraping can be rather extensive to implement using a pull parser. The appropriateness of extension over the other depends on the size and complexity of the document.


© DOM Extension — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (01.09.2014)
Views: 305 | Rating: 0.0/0
Total comments: 0
avatar