Elements and Attributes

Attribute nodes can have values. When the iterator points to an attribute node, the value property will be populated with the node’s value and the hasValue property can be used to check for its presence.

Element nodes can have attributes. When the iterator points to an element node, the hasAttributes property indicates the presence of attributes and the getAttribute() method can be used to obtain an attribute value in the form of a string.

The example below uses both of these together to parse data from an HTML table.

<?php
$inTable = false;
$tableData = array();
while ($doc->read()) {
switch ($doc->nodeType) {
case XMLREADER::ELEMENT:
if ($doc->localName == 'table'
&& $doc->getAttribute('id') == 'thetable') {
$inTable = true;
} elseif ($doc->localName == 'tr' && $inTable) {
$row = count($tableData);
$tableData[$row] = array();
} elseif ($doc->localName == 'td' && $inTable) {
$tableData[$row][] = $doc->readString();
}
break;
case XMLREADER::END_ELEMENT:
if ($doc->localName == 'table' && $inTable) {
$inTable = false;
}
break;
}
}
?>

This showcases the main difference between pull parsers and tree parsers: the former have no concept of hierarchical context, only of the node to which the iterator is currently pointing. As such, you must create your own indicators of context where they are needed.

In this example, the node type is checked as nodes are read and any node that isn’t either an opening or closing element is ignored. If an opening element is encountered, its name ($doc->localName) is evaluated to confirm that it’s a table and its id attribute value ($doc->getAttribute('id')) is also examined to confirm that it has a value of 'thetable’. If so, a flag variable $inTable is set to true. This is used to indicate to subsequent if branch cases that the iterator points to a node that is within the desired table.

The next if branch is entered when table row elements within the table are encountered. A combination of checking the node name and the previously set $inTable flag facilitates this. When the branch is entered, a new element in the $tableData array is initialized to an empty array. This array will later store data from cells in that row. The key associated with the row in $tableData is stored in the $row variable.

Finally, the last if branch executes when table cell elements are encountered. Like the row branch, this branch checks the node name and the $inTable flag. If the check passes, it then stores the current node’s value in the array associated with the current table row.

Here's where the XMLREADER::END_ELEMENT node type comes into play. Once the end of the table is reached, no further data should be read into the array. So, if the ending element has the name 'table’ and the $inTable flag currently indicates that the iterator points to a node within the desired table, the flag is then set to false. Since no other tables should theoretically have the same id attribute, no if branches will execute in subsequent while loop iterations.

If this table was the only one of interest in the document, it would be prudent to replace the $inTable = false; statement with a break 2; statement. This would terminate the while loop used to read nodes from the document as soon as the end of the table was encountered, preventing any further unnecessary read operations.

readString() Availability

As its entry in the PHP manual notes, the readString() method used in the above example is only present when the XMLReader extension is compiled against certain versions of the underlying libxml library.

If this method is unavailable in your environment, an alternative in the example would be to have opening and closing table cell checks that toggle their own flag ($inCell for example) and switch cases for the TEXT and CDATA node types that check this flag and, when it is set to true, add the contents of the value property from the XMLReader instance to the $tableData array.

 


© XMLReader Extension — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 295 | Rating: 0.0/0
Total comments: 0
avatar