The Center for Data Innovation spoke with Julius Černiauskas, CEO of Oxylabs. Oxylabs is a leading proxy network and data-gathering solutions provider that collects alternative data through web scraping. Černiauskas spoke about the potential uses of alternative data, web-scraping as a tool for e-commerce businesses, and how generative AI may change data collection and processing.
Becca Trate: What is alternative data, and what are its benefits?
Julius Černiauskas: The answer lies in the term itself—alternative data is everything that is not traditional data. The latter is well-known for most of us and can include official government statistics, company financial statements, public filings, press releases, datasets provided by NGOs or business organizations, etc. It is often published at regular intervals and is bound by specific regulations.
On the contrary, alternative data is scattered all over the internet and comes in multiple formats. It is usually unstructured and needs to be extracted with the help of scripts. The most common examples of alternative data are satellite imagery, credit card transaction information, and public web data acquired by scraping tools. It can be used for creative research by both businesses and government institutions. For example, the Bank of Japan used recreation and retail trends based on credit card spending to assess economic activity in certain areas.
Today, the alternative data industry is worth almost $7 billion. The main driver behind the alternative data ‘revolution’ is the digitalization of business and advancements in web scraping technologies. Big data has been a hot topic for years, but it is web scraping that unlocked its power by utilizing alternative data for competitive business insights, investigative research, science, and other purposes.
Numerous benefits can be gained from alternative data. First, it can be extracted in real-time, contrary to traditional data that is updated slowly and usually paints a picture of some long past events. This feature of alternative data is particularly important for financial services companies and investors. Furthermore, alternative data can provide completely novel insights. It opens new ways for businesses to gain competitive advantage through a more comprehensive market view and better-informed decisions.
Alternative data use cases can be best understood through examples. I have already mentioned BOJ, but plenty of others exist. Data such as empty parking lot spaces might help predict retailer performance. Investor sentiment analysis offers a glimpse at a market-moving signal. Mobility data can be used to assess economic activity, and so on. A recent survey showed that financial organizations rank web scraping, an alternative data acquisition method, as one of the most impactful in revenue generation.
Trate: How can e-commerce businesses use web scraping to improve the shopping experience?
Černiauskas: The e-commerce industry is one of the heaviest users of web scraping. Recent Oxylabs research revealed that over 82 percent of ecommerce organizations use web scraping to gather external data for decision-making.
The ecommerce industry uses alternative data for market research, competitor analysis, price benchmarking, etc. Extracting public web data allows these companies to understand consumer sentiment, come up with creative personalization tactics, and optimize their assortment. Web scraping opens up a colossal amount of information and makes real-time data streams possible, meaning organizations can extract information the second it appears online.
Talking about the shopping experience, one way in which web scraping can improve it is by optimizing the assortment. By scraping major marketplaces and competitor sites, e-commerce companies can determine which products are trending or going out of stock and which products, on the other hand, are less popular. They can also get ideas for additional assortment if certain goods are sold only by competitors.
Web scraping can also power sentiment analysis. By scraping public reviews, comments, and brand mentions, retail companies can understand customers’ tastes, needs, and pain points, adapt their assortment and marketing strategy accordingly. Moreover, sentiment analysis can show what is trending in certain audiences, allowing the company to validate new commercial ideas or get insights into how certain decisions impacted consumers’ emotions about the brand. Of course, when doing this, companies must first consult with legal practitioners and thoroughly adhere to all private and personal data regulations.
To sum up, if utilized properly, alternative data gathered online can translate into better business-to-consumer relationships and personalization tactics that take into account not only historical behavioral trends (that do not necessarily show what the consumer is interested in now) but also the broader context in which the consumers’ purchasing decisions happen.
Trate: How can web scraping expedite and address copyright infringement or counterfeiting claims in online marketplaces?
Černiauskas: Web scraping is already widely utilized to combat various types of online fraud, from infringement of intellectual property rights to counterfeiting. As illegal sellers and makers of false goods are proliferating rapidly, it is no longer possible to monitor and find them manually. Scraping software, on the other hand, can handle thousands of requests per second, allowing companies to continuously monitor online marketplaces, search engines, and other sites. Instead of tracking unauthorized traders individually, today companies can monitor the brand’s online presence on a large scale and in real time.
As most counterfeiters use specific descriptive keywords, web scrapers crawl thousands of pages and product listings there, searching for various keyword combinations. Usually, it is the brand or product name with such descriptions as “cheap,” “wholesale,” “dealer,” “like original,” etc. Images can also be used to identify illegal goods along with the keywords. After finding illegitimate listings and automatically retrieving the evidence, brands can file copyright complaints and request the marketplace, search engine, or some other site to remove the illegal items.
Trate: What are the main challenges of using alternative data?
Černiauskas: The first and most prominent challenge is the extraction of alternative data itself. It is scattered in different sources and formats and is often case-specific and granular. Therefore, general-purpose one-size-fits-all scrapers cannot do the job properly. Not every DaaS company is capable of providing custom scrapers and parsers, so the companies have to build in-house data extraction teams.
Companies that gather massive amounts of internal and external data might also run into issues when trying to scale and integrate their data operations. As the volume of data increases, managing, processing, and analyzing it becomes more challenging and thus might require machine learning technology. Otherwise, the company will hardly combine a meaningful business or customer overview, ending up with data silos and fragmented information.
Yet another important thing to note is that signals derived from alternative data can be weak compared to traditional data sources. Most alternative data captures only short time windows and isn’t suitable for long-term forecasting. Let’s take the example of the aforementioned sentiment and brand mention analysis – emotional statements change quickly and can be impacted by many factors. As such, alternative data can only be useful for generating short-term (up to 5 years) and highly specific insights. On the other hand, such insights are usually the key to winning the business competition.
Trate: How can generative AI tools, like ChatGPT, improve data collection, analysis, or processing?
Černiauskas: ChatGPT is a trained language model—based on NLP, it can generate almost human-like text, understand textual requests, do translations, and (up to some point) analyze textual data. Basically, it is a massive information summarization machine. As such, it can hardly improve data collection efforts. However, it can aid (again, to some extent) in data processing and analysis.
For example, if you have a simple text-based dataset, you can ask such AI for statistics’ summary or forecasts based on this data. You can also ask it for specific KPIs and SQL code or mathematical formulas for those KPIs. However, you will only get general-purpose examples from the information the algorithm has been fed with. It won’t be use case- and data-specific. As for now, unfortunately, ChatGPT is not designed for data analysis tasks. Of course, this might change in the future as OpenAI claims to have quite enormous goals for its chatbot.
So, however tempting, generative AI such as ChatGPT should still be used with caution. First, it is making a lot of simple mistakes. ML models depend on the data they are trained on, and they can miss new data or not generalize it well. They can also suffer from biases and data gaps due to cognitive bias and human errors in the training set.
Second, generative AI lacks the ability to interact with changing values or dashboards in real time or connect to distributed data sources. As such, it is good for strategic planning, summarizing textual data, or getting some examples and inspiration, but it won’t replace data gathering software and data analysts. Extracting and processing large volumes of data in real time requires specialized infrastructure and knowledge.