Tree Terminology

Once a document is loaded, the next natural step is to extract desired data from it. However, doing so requires a bit more knowledge about how the DOM is structured. Recall the earlier mention of tree parsers. If you have any computer science background, you will be glad to know that the term “tree” in the context of tree parsers does in fact refer to the data structure by the same name. If not, here is a brief rundown of related concepts.

A tree is a hierarchical structure (think family tree) composed of nodes, which exist in the DOM extension as the DOMNode class. Nodes are to trees what elements are to arrays: just items that exist within the data structure.

Each individual node can have zero or more child nodes that are collectively represented by a childNodes property in the DOMNode class. childNodes is an instance of the class DOMNodeList, which is exactly what it sounds like. Other related properties include firstChild and lastChild. Leaf nodes are nodes that have no children, which can be checked using the hasChildNodes method of DOMNode.

All nodes in a tree have a single parent node, with one exception: the root node from which all other nodes in the tree stem. If two nodes share the same parent, they are appropriately referred to as sibling nodes. This relationship is shown in the previousSibling and nextSibling properties in DOMNode.

Lastly, child nodes of a node, child nodes of those child nodes, and so on are collectively known as descendant nodes. Likewise, the parent node of a node, that parent node’s parent node, and so on are collectively known as ancestor nodes.

An example may help to showcase this terminology.

<html>
<body>
<ul id="thelist">
<li>Foo</li>
<li>Ba r</li>
</ul>
</body>
</html>
  • html is the root node.
  • body is the first (and only) child of html.
  • ul is the first (and only) child of body.
  • The li nodes containing Foo and Bar are the first and last child nodes of ul respectively.
  • The li node containing Bar node is the next sibling of the li node containing Foo.
  • The li node containing Foo is likewise the previous sibling of the li node containing Bar.
  • The ul and li nodes are descendants of the body node. 

© DOM Extension — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (01.09.2014)
Views: 300 | Rating: 0.0/0
Total comments: 0
avatar