Using Latent Dirichlet Allocation to assign categories to Amazon Fine Food Reviews. Data citation: Image-based recommendations on styles and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015
First we load, parse, and clean the data. I have only take 100 entries because the LDA Python algorithm is relatively slow and topic names eventually are applied manually
data=list(parse('/users/davidt/documents/cse_190/datasets2/finefoods.txt.gz')) [:100]
data[0]
{'product/productId': 'B001E4KFG0',
'review/helpfulness': '1/1',
'review/profileName': 'delmartian',
'review/score': '5.0',
'review/summary': 'Good Quality Dog Food',
'review/text': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.',
'review/time': '1303862400',
'review/userId': 'A3SGXH7AUHU8GW'}
documents = [b['review/text'] for b in data]
from nltk.corpus import stopwords
import string
StopWords = stopwords.words('english')
def removeNonAscii(datum):
return "".join(i for i in s if ord(i)<128)
def removeStopWords(datum):
return [word for word in datum if word not in StopWords]
def tokenize(message):
import re
message = message.lower()
all_words = re.findall("[a-z0-9']+", message)
return set(all_words)
documents = [removeNonAscii(b) for b in documents]
documents = [tokenize(b) for b in documents]
documents = [removeStopwords(b) for b in documents]
documents[0]
['labrador',
'product',
'looks',
'quality',
'smells',
'better',
'several',
'good',
'food',
'appreciates',
'vitality',
'stew',
'finicky',
'meat',
'canned',
'like',
'dog',
'products',
'processed',
'found',
'bought']
for k, word_counts in enumerate(topic_counts):
for word, count in word_counts.most_common()[:10]:
if count > 0: print k, word, count
For example topic 0 looks like this:
Total observations in the dataset: 568,454. The dataset was split into training and test datasets, random 50-50 split. In order to utilize Latent Dirichlet Allocation for topic modeling feature vectors needed to be created from the text reviews. All the reviews were converted to a bag of words and stop words removed. Quantitatively evaluating the model is done through a ‘perplexity’ score, which measures the log-likelihood of the held-out test set.Lower perplexities are better. However model-fit in regards to topic-modeling does not seem to be an intuitive way to measure whether the topics chosen are ‘accurate’ from a human perspective. Indeed in ‘Reading Teat Leaves: How Humans Interpret Topic Models,’ Sean Gerrish, Chong Wang, and David Blei find that traditional metrics do not capture whether topics are coherent, human measures of interpretability are negatively correlated with traditional metrics to measure the fit of topic models. This assumption will be tested when labels are assigned to the topics created with model. The perplexity metric is used to choose the final model that will be interpreted.
Topics | Perplexity |
---|---|
10 | 2439.96 |
15 | 2459.95 |
20 | 2478 |
25 | 2487 |
30 | 2434.61 |
50 | 2573.42 |
Some of these are easy to categorize, topic 6 looks pet related, topic 19, candy. Some are vague, topic 29 which contains words such as: “husband, beans, easy, spicy, and disappointed.” Manually labeling these topics seems fraught with difficulty. However to test the predictions of the model several categories relatively easy topics are labeled.
Now that categories are assigned we can track user behavior over time.