Debugging

As good a job as it does, tidy may not always be able to clean documents. When using tidy to repair a document, it’s generally a good idea to check for what issues it encounters.

There are two types of issues to check for when using tidy for web scraping analysis: warnings and errors. Like their PHP counterparts, warnings are non-fatal and generally have some sort of automated response that tidy executes to handle them. Errors are not necessarily fatal, but do indicate that tidy may have no way to handle a particular issue.

All issues are stored in an error buffer regardless of their type. Accessing information in and about this buffer is one area in which the procedural and object-oriented APIs for the tidy extension differ.

<?php
// Procedural
$issues = tidy_get_error_buffer($tidy);
// Object-oriented
$issues = $tidy->errorBuffer;
?>

Note that errorBuffer is a property of the $tidy object, not a method. Also note the slight difference in naming conventions between the procedural function and the object property, versus the consistency held throughout most other areas of the APIs.

The error buffer contained within a string is in and of itself mostly useless. Below is a code sample derived from a user contributed comment on the PHP manual page for the tidy_get_error_buffer function. This parses individual components of each issue into arrays where they are more easily accessible.

<?php
preg_match_all(
'/~(?:line (?P<line>\d+) column (?P<column>\d+) - )?' .
'(?P<type>\S+): (?:\[(?:\d+\.?){4}]:)?(?P<message>.*)?$/m', $tidy->errorBuffer, // or tidy_get_error_buffer($tidy)

$issues,
PREG_SET_ORDER
);
print_r($issues);
/*
Example output:
Array
(
[0] => Array (
[0] => line 12 column 1 - Warning: <meta> element not empty or not closed
[line] => 12
[1] => 12 [column] => 1
[2] => 1
[type] => Warning
[3] => Warning
[message] => <meta> element not empty or not closed
[4] => <meta> element not empty or not closed
)
)
*/
?>

The tidy extension also provides a way to get at the number of warnings and errors that are encountered without requiring that you manually parse the error buffer. Unfortunately and rather oddly, this is only supported in the procedural API. However, adding it to the object-oriented API by subclassing the tidy class is fairly simple. Examples of both are shown below.

<?php
// Procedural
$warnings = tidy_warning_count($tidy);
$errors = tidy_error_count($tidy);
// Object-oriented
class mytidy extends tidy {
public function warningCount() {
return tidy_warning_count($this);
}
public function errorCount() {
return tidy_error_count($this);
}
}
?>

© Tidy Extension — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (01.09.2014)
Views: 298 | Rating: 0.0/0
Total comments: 0
avatar