Bots for Business, Chatbots
Web Scraping with Bots: The Good and Bad
Web scraping uses bots to collect large amounts of data from websites. Web-scraping bots began as helpful tools, but they have also been misused in some ways.
October 2, 2018
Every time you ask Siri a question or research statistics on Google, you are directing bots to search the internet for information relating to a query. You expect instant search results, and bots make it all possible. However, have you ever considered what’s powering those requests or how they work? The answer is a technology known as web scraping. It’s a mostly straightforward process, but there are also nuances to it, which we explore here.
What is web scraping
Web scraping is the practice of using programmed software (bots) to collect large amounts of information from websites. Bots continuously operate and comb through the HTML source code of sites. They are scripted to look for specific data, like prices or customer names. Companies use bots to stay aware of their competition by accurately targeting particular demographics
How web scraping works
Scraping a website with a bot involves a few steps. First, a bot downloads a web page (a process known as fetching). The content of the page is then extracted, reorganized, or stored for various purposes, such as audience segmentation analysis. In the case of a search engine bot, the fetched data displays a search engine results page (SERP). Sometimes, especially in Google, the data appears as a featured snippet.
There are other types of scraping, such as screen scraping. This process copies the pixels of an image displayed on a site. Bots crawl the images, title tags, and metadata to determine the purpose and relevance of an image, which is one of the ways search that engines review websites to rank content.
The pros of web-scraping bots
Web scraping is generally used to quickly and systematically retrieve, review, and sort through massive amounts of data. Web scraping with bots has several beneficial uses. Search engines, like Google and Bing, rely heavily on bots to scan sites and rank content. Bots can be automated to analyze a massive amount of information—data that would take humans countless hours to process. For example, weather and traffic update apps use bots to scour the internet for real-time data.
Bots retrieve data from a website’s application programming interface or API. Website APIs house and share large amounts of data. API scraping is extremely useful, mainly when utilized for commercial purposes, such as ticket price comparison or hotel booking availability. Travel comparison sites, like Hotels.com and priceline.com, use bots to auto-populate data. When you interact with the airline ticket search function, you are instructing a bot to return information that fulfills your specifications. A bot scrapes the information that you requested—from all over the internet—and delivers it to you in the form of a comparison website.
Web-scraping bots are looking for data to:
- Answer search engine queries
- Help with analytics
- Improve site UX
- Gain insight into consumer actions
- Aggregate data for RSS feeds
Good bots versus bad bots
Bots scrape data from sites every day, and they typically display an identifying header that includes the name of either a parent company or an individual responsible for the bot. Conversely, malicious bots—those intent on stealing content or misappropriating data—often create false headers or impersonate legitimate bots. To tell the difference between a good bot and a malicious bot, the user must understand the bot’s intent. Good bots compile data or return search queries; bad bots strip content or seek to undercut competitors.
To establish guidelines, website operators register specific pages of their sites for web scraping and they note which pages are off limits. These preferences are outlined in a website’s robot.txt file, which is available to any visitor—human or bot. However, malicious web-scraping bots crawl sites regardless of what the robot.txt file permits. A robot.txt file cannot offer a website any protection from bots.
The cons of web-scraping bots
Although web-scraping bots started as helpful tools, they have been misused in some ways. Some bots are programmed to scrape content from one site and then post that content on another site, without crediting the source or including backlinks. Unless you’re always searching the web for similar-looking content, you probably will not notice that your content has been compromised. For retail websites, this can mean fewer site visits, lower sales, and an adverse effect on page rankings.
Fake websites
Bots can copy the HTML source code of a website and then replicate the entire site. This creates significant issues for e-commerce sites since it puts them at risk for fraud. In some cases, a bot will scrape an entire site, replicate it, and convince users to interact with the phony site. Unsuspecting shoppers then willingly give out valuable information, such as credit card numbers, on an unsecured website. Web scraping can also reveal essential data, like names, email address, and frequency of visits. Bots can even be programmed to take customers’ information and then create fraudulent users from the scraped data.
Intellectual property theft
In addition to content theft, malicious bots steal proprietary information, like industry or trade secrets. Some businesses employ bots to scrape valuable product information and pricing models from a competitor. That information is then utilized to undercut the competition on price or to drive sales toward a competing product. This price-cutting strategy can have a lasting impact on sites that rely on bots to deliver comparison data. After the price of a product or service has been undercut, the value and ranking for that particular item in the market is thrown off.
Analytic issues
The proliferation of bots on the internet also impacts web traffic metrics. If hundreds of bots are interacting with your site, they throw off tracking metrics, since it can be difficult to determine which traffic came from real users and which came from bots. Making marketing decisions with information from flawed data is a waste of time and money.
Defending against bots
Among the most dangerous things about scraping bots is that they can target any site, and they can even circumvent security measures. Although there is no way to prevent all bot-scraping methods, there are ways to protect your website from bots. The trick, however, is to guard your website while also making sure that users can still access the information they need.
One precaution to take is site monitoring. Develop the habit of regularly checking your site logs to look for signs of suspicious or unusual activity, such as a number of actions originating from the same IP address. Increased activity from a single IP address could indicate a DDoS attack. If you suspect that any actions are the result of a bot, you can block access to your site, based on the IP address. You can also limit the actions permitted to a specific user. Another option is to use tracking scripts to monitor bot-scraping activity. The script can send alerts to the site owner when bot activity is suspected.
If your website requires a significant amount of user interaction, consider using a Completely Automated Public Turing Test to Tell Computers and Humans Apart (CAPTCHA), which requires humans to check a box, to indicate that they are not a robot or a computer, before they gain access to certain parts of a website.
Over the past couple of years, bots have changed the way that we interact with each other online. There’s even a case to be made that bots can improve our society. With any new technology comes misuse, and bots are no exception. Not all bots are bad, but awareness of their potential for harm can help protect your website and your online interactions.