In January 2016, Home Depot challenged data scientists to rank the relevance of roughly 170 thousand (search term, product) pairs. The training data consisted of manually scored (search term, product) pairs and a host of product descriptions and information. The competition was hosted by Kaggle and included 2,125 teams.
I placed 120th out of the 2,125 competing teams (top 6%). My best model was a random forest model that took into a account a bunch of features I engineered from the text data. The outline of my work was essentially
1. Clean up the search terms as best as possible (e.g. using spell correction, word stemming, removing special characters, standardizing units like “ft”, “foot”, and “feet”, etc.)
2. Count the frequency of each term AND each bi-gram for search terms and product descriptions
3. Generate various scores for each (search term, product) pair like “How many characters/terms/bigrams are in the search term and product description?”, “What percent of terms/bigrams in the search term are in the product description?”, “What is the cosine similarity of the search term and product description?”, etc.
4. Train a random forest model using the features generated in (3) and the manual scores in the training data to optimize RMSE.