5 Papers on Product Classification Every Data Scientist Should Read

Product categorization/product classification is the organization of products into their respective departments or categories. A large part of the process is the design of the product taxonomy as a whole. 

Product categorization was initially a text classification task that analyzed the product’s title to choose the appropriate category. However, numerous methods have been developed which take into account the product title, description, images, and other available metadata. 

The following papers on product categorization represent essential reading in the field and offer novel approaches to product classification tasks.

1. Don’t Classify, Translate

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization. The experiment uses the Rakuten Data Challenge and Rakuten Ichiba datasets. 

Their method translates or converts a product’s description into a sequence of tokens which represent a root-to-leaf path to the correct category. Using this method, they are also able to propose meaningful new paths in the taxonomy.

The researchers state that their method outperforms many of the existing classification algorithms commonly used in machine learning today.
  • Published/Last Updated – Dec. 14, 2018
  • Authors and Contributors – Maggie Yundi Li (National University of Singapore), Stanley Kok (National University of Singapore), and Liling Tan (Rakuten Institute of Technology)

2. Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models

The authors of this paper propose attention convolutional neural network (ACNN) models over baseline convolutional neural network (CNN) models and gradient boosted tree (GBT) classifiers. 

The study uses Japanese product titles taken from Rakuten Ichiba as training data. Using this data, the authors compare the performance of the three methods (ACNN, CNN, and GBT) for large-scale product categorization. 

While differences in inaccuracy can be less than 5%, even minor improvements in accuracy can result in millions of additional correct categorizations. 

Lastly, the authors explain how an ensemble of ACNN and GBT models can further minimize false categorizations.


  • Published/Last Updated – April, 2017 for EACL 2017
  • Authors and Contributors – From the Rakuten Institute of Technology: Yandi Xia, Aaron Levine, Pradipto Das Giuseppe Di Fabbrizio, Keiji Shinzato and Ankur Datta 

3. Atlas: A Dataset and Benchmark for Ecommerce Clothing Product Classification

Researchers at the University of Colorado and Ericsson Research (Chennai, India) have created a large product dataset known as Atlas. In this paper, the team presents its dataset which includes over 186,000 images of clothing products along with their product titles.