Legality of Web Scraping

The legality of web scraping is a rather complicated question, mainly due to copyright and intellectual property laws. Unfortunately, there is no easy and completely cut-and-dry answer, particularly because these laws can vary between countries. There are, however, a few common points for examination when reviewing a prospective web scraping target.

First, web sites often have documents known as Terms of Service (TOS), Terms or Conditions of Use, or User Agreements (hereafter simply known as TOS documents for the sake of reference). These are generally located in an out-of-the-way location like a link in the site footer or in a Legal Documents or Help section. These types of documents are more common on larger and more well-known web sites. Below are segments of several such documents from web sites that explicitly prohibit web scraping of their content.

  • “You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers)..." -Google Terms of Service, section 5.3 as of 2/14/10
  • “You will not collect users’ content or information, or otherwise access Face-book, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission." - Facebook Statement of Rights and Responsibilities, Safety section as of 2/14/10
  • “Amazon grants you a limited license to access and make personal use of this site ... This license does not include ... any use of data mining, robots, or similar data gathering and extraction tools." - Amazon Conditions of Use, LICENSE AND SITE ACCESS section as of 2/14/10
  • “You agree that you will not use any robot, spider, scraper or other automated means to access the Sites for any purpose without our express written permission." - eBay User Agreement, Access and Interference section as of 2/14/10
  • “... you agree not to: ... access, monitor or copy any content or information of this Website using any robot, spider, scraper or other automated means or any manual process for any purpose without our express written permission; ..." - Expedia, Inc. Web Site Terms, Conditions, and Notices, PROHIBITED ACTIVITIES section as of 2/14/10
  • “The foregoing licenses do not include any rights to: ... use any robot, spider, data miner, scraper or other automated means to access the Barnes & No-ble.com Site or its systems, the Content or any portion or derivative thereof for any purpose; ..." - Barnes & Noble Terms of Use, Section I LICENSES AND RESTRICTIONS as of 2/14/10

Determining whether or not the web site in question has a TOS document will be the first step. If you find one, look for clauses using language similar to that of the above examples. Also, look for any broad “blanket" clauses of prohibited activities under which web scraping may fall.

If you find a TOS document and it does not expressly forbid web scraping, the next step is to contact representatives who have authority to speak on behalf of the organization that owns the web site. Some organizations may allow web scraping assuming that you secure permission with appropriate authorities beforehand. When obtaining this permission, it is best to obtain a document in writing and on official letterhead that clearly indicates that it originated from the organization in question. This has the greatest chance of mitigating any legal issues that may arise.

If intellectual property-related allegations are brought against an individual as a result of usage of an automated agent or information acquired by one, assuming the individual did not violate any TOS agreement imposed by its owner or related computer use laws, a court decision will likely boil down to whether or not the usage of said information is interpreted as “fair use" with respect to copyright laws in the geographical area in which the alleged offense took place.

Please note that these statements are very general and are not intended to replace the consultation of an attorney. If TOS agreements or lack thereof and communications with the web site owner prove inconclusive, it is highly advisable to seek legal council prior to any attempts being made to launch an automated agent on a web site. This is another reason why web scraping is a less-than-ideal approach to solving the problem of data acquisition and why it should be considered only in the absence of alternatives.

Some sites actually use license agreements to grant open or mildly restricted usage rights for their content. Common licenses to this end include the GNU Free Documentation license and the Creative Commons licenses. In instances where the particular data source being used to acquire data is not relevant, sources that use licenses like these should be preferred over those that do not, as legalities are significantly less likely to become an issue.

The second point of inspection is the legitimacy of the web site as the originating source of the data to be harvested. Even large companies with substantial legal resources, such as Google, have run into issues when their automated agents acquired content from sites illegally syndicating other sites. In some cases, sites will attribute their sources, but in many cases they will not.

For textual content, entering direct quotations that are likely to be unique from the site into major search engines is one method that can help to determine if the site in question originated the data. ft may also provide some indication as to whether or not syndicating that data is legal.

For non-textual data, make educated guesses as to keywords that correspond to the subject and try using a search engine specific to that particular data format. Searches like this are not intended to be extensive or definitive indications, but merely a quick way of ruling out an obvious syndication of an original data source.


© Tips and Tricks — Web Scraping

>>> Back to TABLE OF CONTENTS <<<
Category: Article | Added by: Marsipan (03.09.2014)
Views: 310 | Rating: 0.0/0
Total comments: 0
avatar