The primary objective of this experiment is to deploy advanced NLP techniques to identify similar products in a retail dataset. To accomplish this, we will consider multiple text fields and product metadata to find the similarity score.
Further, we will explore multiple approaches to visualise/inspect text data, and to build scalable similarity scoring methods.
Business use case
There are multiple direct and indirect applications of this experiment. Some of them include –
- Retail: find same/similar products across competition for pricing/promo/range analysis
- CPG: track your products listings across retailer for price/promo/brand message, etc.
- Ecommerce marketplace: compare across portfolio/sellers to optimise for range
- Multi-sector: input into recommendation systems
Also, the key principles from this experiment can be applied to any problem related to text matching. This in turn will have applications across multiple industries, and outside of the context of product matching.
We are using Amazon Review Data (2018) to accomplish this task.
This Dataset is an updated version of the Amazon review dataset released in 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
This Dataset contains products from various categories, but to maintain the scalability, we are using Grocery and Gourmet Food category data.
The Dataset can be found here
- Python; for Documentation, Exploratory Data Analysis and Preprocessing using reticulate package and Python regular expressions.
- For Training – AWS Deep Learning AMI (instance type: ps.2xlarge); considering high training required load to generate embeddings, we have used GPU processors to manage training time
- For inference – AWS Deep Learning AMI (instance type: t2.large); loading the trained model & using it for similarity scoring.
We were able to find similar products within our dataset with an overall accuracy of ~90% (based on category matches) for 100% coverage through the different approaches tested in the experiment.