Tree Terminology
Once a document is loaded, the next natural step is to extract desired data from it. However, doing so requires a bit more knowledge about how the DOM is structured. Recall the earlier mention of tree parsers. If you have any computer science background, you will be glad to know that the term “tree” in the context of tree parsers does in fact refer to the data structure by the same name. If not, here is a brief rundown of related concepts. A tree is a hierarchical structure (think family tree) composed of nodes, which exist in the DOM extension as the DOMNode class. Nodes are to trees what elements are to arrays: just items that exist within the data structure. Each individual node can have zero or more child nodes that are collectively represented by a childNodes property in the DOMNode class. childNodes is an instance of the class DOMNodeList, which is exactly what it sounds like. Other related properties include firstChild and lastChild. Leaf nodes are nodes that have no children, which can be checked using the hasChildNodes method of DOMNode. All nodes in a tree have a single parent node, with one exception: the root node from which all other nodes in the tree stem. If two nodes share the same parent, they are appropriately referred to as sibling nodes. This relationship is shown in the previousSibling and nextSibling properties in DOMNode. Lastly, child nodes of a node, child nodes of those child nodes, and so on are collectively known as descendant nodes. Likewise, the parent node of a node, that parent node’s parent node, and so on are collectively known as ancestor nodes. An example may help to showcase this terminology. <html> <body> <ul id="thelist"> <li>Foo</li> <li>Ba r</li> </ul> </body> </html>
© DOM Extension — Web Scraping >>> Back to TABLE OF CONTENTS <<< | |
Views: 348 | |
Total comments: 0 | |