putilov_denis - stock.adobe.com
Web scraping has been around almost as long as there has been a web to scrape. The technology forms the cornerstone of search services like Google and Bing and can extract large amounts of data.
Data collection on the web tends to be at the mercy of how it is presented, and many sites actively discourage web scraping. However, developers can create web scraping applications in languages such as Python or Java to help bring data into a variety of AI applications. It is crucial for developers to carefully think about pipelines they use for acquiring their data. Each step of this process -- getting the right data, cleaning it and then organizing it into the appropriate format for their needs -- must be reviewed.
These pipelines are a continual work in progress. The perfect web scraping pipeline for today may have to be completely revamped for tomorrow. Knowing this, there are a variety of tools and best practices that can help automate and refine these pipelines and keep organizations on the right path.
Web scraping applications and AI
Web scraping involves writing a software robot that can automatically collect data from various webpages. Simple bots might get the job done, but more sophisticated bots use AI to find the appropriate data on a page and copy it to the appropriate data field to be processed by an analytics application.
AI web scraping-based use cases include e-commerce, labor research, supply chain analytics, enterprise data capture and market research, said Sarah Petrova, co-founder at Techtestreport. These kinds of applications rely heavily on data and the syndication of data from different parties. Commercial applications use web scraping to do sentiment analysis about new product launches, curate structured data sets about companies and products, simplify business process integration and predictively gather data.
One specific web scraping project includes curating language data for non-English natural language processing (NLP) models or capturing sports statistics for building new AI models for fantasy sports analysis. Burak Özdemir, a web developer based in Turkey, used web scraping to build a neural network model for NLP tasks in Turkish.
Sayid ShabeerChief product officer, HighRadius
"Although there are so many pretrained models that can be found online for English, it's much harder to find a decent data set for other languages," Özdemir said. He has been experimenting with scraping Wikipedia and other platforms that have structured text to train and test his models -- and his work can provide a framework for others looking to develop and train NLP in non-English languages.
The tools of web scraping
There is a variety of tools and libraries that developers can use to jumpstart their web scraping projects. Primarily, Python has web scraping technology readily available via online libraries.
Python plays a significant role in AI development with focus on web scraping, Petrova said. She recommended considering libraries like Beautiful Soup, lxml, MechanicalSoup, Python Requests, Scrapy, Selenium and urllib.
Each tool has its own strength and they can often complement one another. For example, Scrapy is an open source and collaborative framework for extracting data that is useful for data mining, monitoring and automated testing. Beautiful Soup is a Python library for pulling data out of HTML and XML files. Petrova said she deploys it for modeling scrape scripts as the library provides simple methods and Pythonic idioms for navigating, searching and modifying a parse tree.
Augmenting data with web scraping
AI algorithms are often developed on the front end to learn which sections of a webpage contain fields such as product data, review or price. Petrova noted that combining web scraping with AI, the process of data augmentation can become more efficient.
"Web scraping, especially smart, AI-driven, data extraction, cleansing, normalization and aggregation solutions, can significantly reduce the amount of time and resources organizations have to invest in data gathering and preparation relative to solution development and delivery," said Julia Wiedmann, machine learning research engineer, at Diffbot, a structured web search service.
Petrova said common data augmentation techniques include:
- extrapolation (relevant fields are updated or provided with values);
- tagging (common records are tagged to a group, making it easier to understand and differentiate for the group);
- aggregation (using mathematical values of averages and means -- values are estimated for relevant fields, if needed); and
- probability techniques (based on heuristics and analytical statistics -- values are populated based on the probability of events).
Using AI for resilient scraping
Websites are built to be human-readable and not machine-readable, which makes it hard to extract at scale and across different page layouts. Anyone who has tried to aggregate and maintain data knows what a difficult task this can be -- whether it be a manually compiled database with typos, missing fields, and duplicates or the variability of online content publication practices, Wiedmann said.
Her team has developed AI algorithms that use the same cues as a human to detect the information that should be scraped. She has also found that it is important to integrate outputs into applied research or test environments first. There can be hidden variability tied to the publication practices of the sources. Data quality assurance routines can help minimize the manual data maintenance.
"Designing systems that minimize the amount of manual maintenance will reduce errors and data misuse," Wiedmann said.
Improving data structure
AI can also structure data collected with web scraping to improve the way it can be used by other applications.
"Though web scraping has existed for a long time, the use of AI for web extraction has become a game changer," said Sayid Shabeer, chief product officer at HighRadius, an AI software company.
Traditional web scraping can't extract structured data from unstructured documents automatically, but recent advancements built AI algorithms that work in data extraction in a similar fashion to humans and that continue to learn as well. Shabeer's team used these types of bots for extracting remittance information for cash applications from retail partners. The web aggregation engine regularly logs into retailer websites and looks for remittance information. Once the information becomes available, the virtual agents automatically capture the remittance data and provide it in a digital format.
From there, a set of rules can be applied to further enhance the quality of data and bundle it with the payment information. AI models allow the bots to master a variety of tasks rather than have them focus on just one process.
To build these bots, Shabeer's team collated the common class names and HTML tags that are used on various retailer's websites and fed these into the AI engine. This was used as training data to ensure that the AI engine could handle any new retailer portals that were being added with minimal to no manual intervention. Over time, the engine became more and more capable of extracting data without any intervention.
Limitations of web scraping
In a recent case in which LinkedIn tried to prevent HiQ Labs from scraping its data for analytics purposes, the U.S. Supreme Court has ruled that web scraping for analytics and AI can be legal. However, there are still a variety of ways that websites may not intentionally or accidentally break web scraping applications.
Petrova said that some of the common limitations she has encountered include:
- Scraping at scale. Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the codebase, collecting data and maintaining a data warehouse.
- Pattern changes. Each website periodically changes its user interface.
- Anti-scraping technologies. Some websites use anti-scraping technologies.
- Honeypot traps. Some website designers put honeypot traps inside websites to detect web spiders and deliver false information. This can involve generating links that normal users can't see but crawlers can.
- Quality of data. Records that do not meet the quality guidelines will affect the overall integrity of the data.
Browser vs. back end
Web scraping is generally done by a headless browser that can scour webpages independent of any human activity. However, there are AI chatbot add-ons that scrape data as a background process running in the browser that can help users find new information. These front-end programs use AI to decide how to communicate the appropriate information to a user.
Marc Sloan, co-founder and CEO at Scout, an AI web scraping chatbot, said they originally did this by using a headless browser in Python that pulled webpage content via a network of proxies. Information was extracted from the content using a variety of techniques. Sloan and his team used Spacy to extract entities and relations from unstructured text into knowledge graphs using Neo4j. Convolutional networks were used to identify features such as session type, session similarity and session endpoints.