Forms

Some web scraping applications must push data to the target application. This is generally accomplished using HTTP POST requests that simulate the submission of HTML forms. Before such requests can be sent, however, there are a few events that generally have to transpire. First, if the web scraping application is intended to be presented to a user, a form that is at least somewhat similar must be presented to that user. Next, data submitted via that form by the user should be validated to ensure that it is likely to be accepted by the target application.

The applicability of this technique will vary by project depending on requirements and how forms are structured. It involves scraping the markup of the form in the target application and using the scraped data to generate something like a metadata file or PHP source code file that can be dropped directly into the web scraping application project. This can be useful to expedite development efforts for target applications that have multiple forms or complex forms for which POST requests must be simulated.

For the purposes of formulating a POST request, you will want to query for elements with the names input, select, textarea, or possibly button that have a name attribute. Beyond that, here are a few element-specific considerations to take into account.

  • input elements with a type attribute value of checkbox or radio that are not checked when the form on the web scraping application side is submitted should not have their value attribute values included when the POST request is eventually made to the target application. A common practice that negates this is positioning another element with a type attribute value of hidden with the same name as the checkbox element before that element in the document so that the value of the hidden element is assumed if the checkbox is not checked.
  • select elements may be capable of having multiple values depending on whether or not the multiple attribute is set. Flow this is expressed in the POST request can depend on the platform on which the target application is running. The best way to determine this is to submit the form on the target application via a client that can show you the underlying POST request being made.
  • input elements that have a maxlength attribute are restricted to values of that length or less. Likewise, select elements are restricted to values in the value attributes of their contained option child elements. Both should be considered when validating user-submitted data.

© Tips and Tricks — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 327 | Rating: 0.0/0
Total comments: 0
avatar