Libraries

At this point, CSS selectors have been covered to the extent that all or a subset of those supported by a given library are explained. This section will review some library implementations that are available, where to find them, what feature set they support, and some advantages and disadvantages of using them.

PHP Simple HTML DOM Parser

The major distinguishing trait of this library is its requirements: PHP 5 and the PCRE extension (which is pretty standard in most PHP distributions). It has no external dependencies on or associations with other libraries or extensions, not even the standard XML extensions in PHP

The implication of this is that all parsing is handled in PHP itself, which makes it likely that performance will not be as good as libraries that build on a PHP extension. However, in environments where XML extensions (in particular the DOM extension) may not be available (which is rare), this library may be a good option. It offers basic retrieval support using PHP’s filesystem functions (which require the configuration setting allow_url_fopen to be enabled to access remote documents).

The documentation for this library is fairly good and can be found at http://simplehtmldom.sourceforge.net/manual.htm.    Its main web

site, which includes a link to download the library, is available at http://simplehtmldom.sourceforge.net. It is licensed under the MIT License.

Zend_Dom_Query

One of the components of Zend Framework, this library was originally created to provide a means for integration testing of applications based on the framework. However, it can function independently and apart from the framework and provides the functionality needed in the analysis phase of web scraping. At the time of this writing, Zend Framework 1.10.1 requires PHP 5.2.4 or higher.

Zend_Dom_Query makes extensive use of the DOM extension. It supports XPath through use of the DOM extension’s DOMXPath class and handles CSS expressions by transforming them into equivalent XPath expressions. Note that only CSS 2 is supported, which excludes non-attribute filters.

It’s also worth noting that Zend_Dom_Query offers no retrieval functionality. All methods for introducing documents into it require that those documents be in string form beforehand. If you are already using Zend Framework, a readily available option for retrieval is Zend_Http_Client, which is also discussed in this book.

Documentation for Zend_Dom_Query can be found at http://framework.zend.com/manual/en/zend.dom.query.html.    At this

time, there is no officially supported method of downloading only the Zend_Dom package. The entire framework can be downloaded from http://framework.zend.com/download/current/ and the directory for the Zend_Dom package can be extracted from it. An unofficial method of downloading individual packages can be found at http://epic.codeutopia.net/pack/. Zend Framework components are licensed under the New BSD License.

phpQuery

phpQuery is heavily influenced by jQuery and maintains similarity to it insofar as its runtime environment being the server (as opposed to the client) will allow. It requires PHP 5.2 and the DOM extension as well as the Zend_Http_Client and Zend_Json components from Zend Framework, which are bundled but can be substituted with the same components from a local Zend Framework installation.

CSS support is limited to a subset of CSS3. Most jQuery features are supported including plugin support, with porting of multiple jQuery plugins being planned. Other components include a CLI utility that makes functionality from the phpQuery library available from command line and a server component for integrating with jQuery via calls made from it on the client side. Retrieval support is included in the form of integration with Zend_Http_Client.

Documentation and download links are available from http://code.google.com/p/phpquery/. It is licensed under the MIT License.

DOMQuery

This library is actually a project of my own. While still in alpha at the time of this writing, it is fairly functional and includes a full unit test suite. Like some of the other libraries mentioned in this chapter, it requires PHP 5 and makes heavy use of the DOM extension.

Unlike the others, however, it does not implement a CSS selector parser in order to offer related functionality. Instead, it does so programmatically through its API. For example, rather than passing the name of an element (say div) to a central query method, an element!) method accepts the name of an element for which to query. Though this makes it a bit less concise than other libraries, it also makes it more expressive and only requires a basic knowledge of DOM concepts in order to operate it.

It can be downloaded at http://github.com/elazar/domquery/tree/master. The central class DOMQuery is documented using phpDoc-compatible API docblocks and the unit test suite offers use cases for each of the available methods.


© CSS Selector Libraries — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 351 | Rating: 0.0/0
Total comments: 0
avatar