Related to:
Sept. 28 2020 09:12 AM

How bots will soon scrape web data for you


I suppose we could also title this article the “Rise of the Robotic Curator” since finding, collecting and grouping information is what they do and is essentially what curation focuses on.

What is interesting about this concept is the use of artificial intelligence (AI)-enabled bots to go beyond what we have been hearing about for several years — Robotic Process Automation (RPA) — for automating internal processes, in similar fashion against inbound information. In the case of web bots, or robotic curators as I like to call them, the target could be internal or external websites.

In a simple way to explain this, every time you do a web search using a browser, you are using web scraping and bots to find what you are looking for and present the results to you. The same holds true of Siri, Alexa, Google and other voice activated search engines. You ask, a search is conducted based on the request you provided, and the results are presented to you. It sounds simple but behind the scenes takes a lot to get there.

Purposeful Robotic Curators
Bots are built and designed to extract data using an Application Programming Interface (API). This allows the bots to look at websites, scrape the information from the relevant sites and present it back to you. Internally, organizations may use robotic curators combined with AI, to search, collect, analyze the results and present that to the requestor for use in identifying trends, their customer experience results and other business elements for clarity and decision making.

Externally, robotic curators can be used to search against the competition to collect competitive information, analyze what is gathered and develop plans to address the competition head on. When used in this scenario, there are guidelines typically available on the target sites highlighting what is acceptable to search and what is off limits. This should be part of the organization’s governance plan when setting up sites, but also provided to information managers, in relation to conducting searches.

The reason this is important, and must also be part of your security plan, is that web bots have been deployed to scrape information from a site, place it on a fake site and then redirect visitors to the fake site. This is not new, but the use of bots has simplified and enhanced how it can be done. The last thing you want is for intellectual property to be scraped, used against you, or even made public.

While there is no way to completely ensure you are protected from bot scraping, monitor site logs for unusual activity or implement a Completely Automated Public Turing Test to Tell Computers and Humans Apart (CAPTCHA), requiring human validation using a check a box or typing in a random passcode, to indicate that they are not a robot. This has become a very common step for many retails, business, and government sites.

The fact is, web bots are and have been in use for many years. While they may not have been as sophisticated as they are today, incorporating AI, search engines of days past were the beginning of web bots. As time passed, technology became stronger and capabilities grew, today’s bots can be programmed to find exact matches, determine if something may also be a good match, analyze and rank the findings and present the results to the requestor.

Today, a profile, perhaps a case file, can be developed specifying criteria for vital information related to the case. The robotic curator, using that profile, then searches internal sources and external web sites to find the requested information, analyze it, rank it, place it into the designated case file and notifies the requestor new information has been added. Additionally, if several cases require the same or similar information, it can be shared across the cases.

As information is accepted as relevant or irrelevant, the AI element is refined to reflect these decisions, learning and refining what is acceptable and what is not, providing better results and perhaps minimizing the amount of information collected.

In My View
Web scraping — robotic curation — is alive and well today. I have personally been using bots to monitor topical areas of interest and provide me with daily updates as part of my research efforts. The question is one of business professionals using it regularly and I contend they unwittingly do now, and organizations will soon formalize it as part of their information ecosystems.

The real challenge comes in governing the use of robotic curation and what is the purpose? Questions to ask include:
  • What tools are used?
  • Who is authorized to use these tools?
  • What information is collected and for what reason?
  • Who will this be shared with, especially if it is of a sensitive nature?
  • How do you protect your web site from being scraped?
There are many benefits to be gained from robotic curation of websites. In my opinion, those who embrace it early, by design, and with a specific purpose, are those who will gain the most and take the lead against their competitors. As with all other technologies, this should be treated as an ongoing practice in that once a project is complete, it becomes the beginning of a continuous improvement effort that regularly assesses, identifies additional opportunities for improvement and builds upon the foundation put in place.
  • Digital Asset Management (DAM) is a system designed for organizing, storing and retrieving media files and managing digital rights and permissions. DAM systems have become a core component of creative
  • Is Generative AI tipping the scales in favor of building Enterprise Content Management (ECM) software, or will it ever get to that point?
  • Information technology has undergone a major transformation in recent years, sparked by the rise of “big data.”
  • Every day, large organizations face multiple challenges with the hundreds or thousands of pieces of mail received through the USPS and other carriers, documents that include general business mail
  • Personalizing things is not new. We have engraved items and composed personal letters and communications for centuries, but can we do this economically and efficiently?

Most Read  

This section does not contain Content.