I have recently purchased a subscription to DataCamp. If you haven’t tried out their online courses, I highly suggest you do so. I end up using DataCamp even on days where I don’t have enough time for their courses, since they offer neat short quizzes on various topics such as intro/intermediate R, data cleaning in R, SQL, etc. These quizzes aid in preventing you from becoming dull, by refreshing your mind on various aspects of R you haven’t utilized in quite awhile. I recently took a course on introductory text analysis in R, and I thought I’d apply some analytical methods I’ve learned from that course, in addition to others I have seen applied in articles.
I’ll digress to keep me from sounding like a Data Camp shill, and I’ll introduce the dataset we’re working with today. This data may have been obtained in a legally gray manner, since I know Yelp’s EULA doesn’t allow for web scraping, but I obtained this dataset from data.world so I’ll assume it was obtained legally. You can find a link to the data here. Let’s import the data and take a look at the structure:
AustYelp <- read.csv("../../../raw_yelp_review_data.csv", header = T)
str(AustYelp)
## 'data.frame': 7616 obs. of 3 variables:
## $ coffee_shop_name: Factor w/ 79 levels "Alta’s Cafe ",..: 69 69 69 69 69 69 69 69 69 69 ...
## $ full_review_text: Factor w/ 6915 levels " 1/1/2012 1 check-in I had a quad espresso and my boo had a latte. Friendly, knowledgable staff. No wait at all"| __truncated__,..: 1550 1956 1668 1558 2083 1448 920 1422 867 1228 ...
## $ star_rating : Factor w/ 5 levels " 1.0 star rating ",..: 5 4 4 2 4 4 4 5 3 5 ...
First, there are a couple of things we have to address pertinent the structure of the data. We should split the review date from the review text column, convert the coffee shop name to a factor and remove the " star rating" portion of the review string, not forgetting to trim the whitespace.
AustYelp$coffee_shop_name %<>% trimws( which = "both") %>% as.factor()
AustYelp$star_rating %<>% gsub(pattern = " star rating ", replacement = "") %>% trimws( which = "both")
AustYelpSplit <- AustYelp$full_review_text %>% trimws(which = "both") %>%stringr::str_split_fixed(pattern = " ",n = 2) %>% as.data.frame()
names(AustYelpSplit) <- c("Date", "Reviewtext")
AustYelp %<>% cbind( AustYelpSplit)
AustYelp <- AustYelp[,-2]
Taking a quick look at the data, we see that some reviews contain the number of “check-ins”. Let’s go ahead and eliminate that part of the string using gsub again, and lets go ahead and make the review score a numerical variable:
AustYelp$Reviewtext %<>% gsub(pattern = "check-ins", replacement = "check-in") %>% gsub(pattern = "\\d check-in ", replacement = "")
AustYelp$star_rating %<>% as.numeric()
Let’s take a look at a the number of occurrences for a 1-5 star review, as well as the distribution of the average review for each business.
AustYelpRev <- AustYelp %>% ggplot(aes(x = star_rating, fill = as.factor(star_rating))) + geom_histogram(show.legend = FALSE, binwidth = 1)
AustYelpRev %>% ggplotly()
We can see that coffee houses in austin have an awfully high proportion of 5 star reviews, probably more than we would expect. Only around 1500 are 3 stars or lower, and we have 7600 reviews! So we have 6000 reviews that are 4 stars or more. Let’s get a good look at the distribution of the average score per coffee house.
histplotmean <- AustYelp %>% group_by(coffee_shop_name) %>% summarize(mean = mean(star_rating)) %>% ggplot(aes(mean)) + geom_histogram(binwidth = 0.1)
histplotmean %>% ggplotly
Let’s get a distribution of the amount of reviews per Cafe:
ReviewNumGroup <- AustYelp %>% group_by(coffee_shop_name) %>% count(coffee_shop_name) %>% ggplot(aes(x = n)) + geom_histogram(binwidth = 35)
ReviewNumGroup %>% ggplotly()
Out of the 79 coffee houses in our data set, 60 have 100 reviews or more. This is a good review sample size. Let’s see how the number of our reviews trend by time. While the plot is indeed considerably noisy, it does not obfuscate the fact that reviews began to increase in rate around 2015.
AusDateNum <- AustYelp %>% group_by(Date) %>% count(Date)
AusDateNum$Date <- as.Date(AusDateNum$Date, "%m/%d/%Y")
AusDateNum <- ggplot(AusDateNum,aes(Date,n)) + geom_line()
AusDateNum %>% ggplotly()
Now to begin the objective of this post. Let’s take a good look at the most commonly used words in the entire aggregate of reviews. To do this, we have to first unnest and tokenize the words using the unnest_tokens arguments that requires a new name for our column that will contain the aggregate of individual words, and the column to tokenize. This function converts all words to lower case by default, but if you’d wish to keep the capitalization of words, you can set to_lower = F. This might be appropriate if you wish to obtain words that are emphasized in reviews, and/or wish to analyze the 20 latest click-bait titles that will BLOW YOUR MIND.
AustYelpWC <- AustYelp %>% mutate(id = row_number()) %>% unnest_tokens(word, Reviewtext) %>% anti_join(stop_words, by = "word")
## Count the words
AustYelpWC %<>% count(word) %>% mutate(word2 = fct_reorder(word,desc(n))) %>% top_n(25,n)
## Plot using ggplot2
Pwc <- AustYelpWC %>% ggplot(aes(word2, n)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 0.5))
ggplotly(Pwc)
The first two words are not of much use to us. The most frequently used word is “coffee”, followed by “austin”, which is appropriate since we’re reviewing coffee houses in Austin, but it’s not very useful. There are some other words that may not help us out here, such as “love”, “sweet”, “love”, and “time”, but you are able to eliminate certain stop words at your own discretion. Let’s just get rid of “coffee”, and “austin” for now and then replot.
custom_stop <- tribble(~word, ~ lexicon, "coffee", "CUSTOM", "austin", "CUSTOM" )
custom_stop <- stop_words %>% bind_rows(custom_stop)
AustYelpWC <- AustYelp %>% mutate(id = row_number()) %>% unnest_tokens(word, Reviewtext) %>% anti_join(custom_stop, by = "word")
## Count the words
AustYelpWC %<>% count(word) %>% mutate(word2 = fct_reorder(word,desc(n))) %>% top_n(30,n)
## Plot using ggplot2
Pwc <- AustYelpWC %>% ggplot(aes(word2, n)) + geom_col() + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 0.5))
ggplotly(Pwc)
Now we’re able to discern significantly relevant words. We can assume “love”, “nice”, and “friendly” are words that commonly appear in review text that offers praise. While some other words like “seating”, “service”, and “food”, can be ambiguous.
We can try to suppress some of the ambiguity, by looking at reviews with lower scores and higher scores separately. In order to do this, we’ll take only the 1,2 & 5-star reviews. We can argue that there is not much difference between the 1 and 2 star reviews, and the combination of these gives us a comparable review sample size as the 5 star reviews. First we unnest the tokens, filter out stop-words and perform a word count for each unique word grouped by the reviews star rating (i.e count the number of times an ambiguous word like “coffee” appears in both 1, 2, 3 .. etc star reviews)
BestWorst <- AustYelp %>% mutate(id = row_number()) %>% unnest_tokens(word, Reviewtext) %>% anti_join(custom_stop, by = "word") %>% group_by(star_rating) %>% filter(star_rating %in% c(1,2,5)) %>% count(word)
BestWorst$star_rating[BestWorst$star_rating %in% c(1,2)] <- 1
BestWorst$word %<>% as.factor()
BestWorstFilt <- BestWorst %>% group_by(star_rating) %>% mutate(word2 = fct_reorder(word,desc(n))) %>% top_n(20,n)
BWWCP <-BestWorstFilt %>% ggplot(aes(word2, n)) + geom_col() + facet_wrap(~star_rating, scales = "free")+ theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 0.5)) + labs(title = "Best & Worst Review Word Count", x = "", y = "Count")
ggplotly(BWWCP)
We can tell a source of major complaints in 1 & 2 star reviews, are the “time”, “service”, “drinks”, and “food”. While the 5 star reviews also have “service”, “staff”, “atmosphere”, and “gelato” as a source of compliments. We can get a better sense of how key words appearing in reviews tie into the overall star rating received by each coffee house.
To get a better sense of the correlation between key sentiment words contained in reviews, and the average star rating we must first familiarize ourselves with sentiment dictionaries. There are several sentiment dictionaries available through R. The most common are the “bing”, “afinn”, and “nrc”. Bing and NRC maps categorical variables to each word. For Bing, it’s simply “positive” or “negative”, for NRC the sentiments mapped are much more nuanced and include words such as “trust”, “fear”, and “anger”. afinn maps numerical values from -5 to +5 that indicates valence (bad vs good). Here’s a quick look at afinn:
get_sentiments("afinn")[c(1,30,606,900,1300),]
## # A tibble: 5 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 accidental -2
## 3 delights 3
## 4 exempt -1
## 5 influential 2
Let’s visualize the change in nrc sentiment terms between 1,3 & 5 star reviews. We can do this by filtering out the stop words after applying unnest tokens, and then applying an inner join with the nrc sentiment dictionary. We’ve chosen differing arbitrary colors to identify positive and negative valence.
#Group by coffee shop name, tokenize each word, filter out stop words, and join nrc sentiment dictionary.
AustSent <- AustYelp %>% group_by(coffee_shop_name) %>% unnest_tokens(word, Reviewtext) %>% anti_join(custom_stop, by = "word") %>% inner_join(get_sentiments("nrc"), by = "word")
#Select only the star rating, word and sentiment of the word, group by the star rating and count the sentiment.
AustSent %<>% select(star_rating, word, sentiment) %>% group_by(star_rating) %>% count(sentiment)
# group by the star rating and find the percentage of each sentiment term. Filter out 2 & 4
AustSent %<>% group_by(star_rating) %>% mutate(SentPerc = n/sum(n) * 100) %>% filter(star_rating %in% c(1,3,5))
# Apply arbitrary values to valence for color fill
AustSent[AustSent$sentiment %in% c("anger", "disgust", "fear", "negative", "sadness"),"valence"] <- -1
AustSent[AustSent$sentiment %in% c("anticipation", "joy", "positive", "trust"),"valence"] <- 1
AustSent[AustSent$sentiment %in% c("surprise"),"valence"] <- 0
# plot
sentplot <- ggplot(AustSent, aes(sentiment, SentPerc, fill = valence)) + facet_wrap(~star_rating) + geom_col() +scale_fill_gradient(low='#d6634d', high='#3961a0') + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 0.5))
ggplotly(sentplot)
We can see in the plot above that the content of the review reduces in terms with negative valence as our review score increases. Comparing the one and five star reviews, we can note that percentage of words that describe anger, disgust, and fear are reduced by a factor of two, while the percentage of words with positive connotations are doubled. While the change in makeup between positive and negative terms throughout these three review scores aren’t striking, they definitely are discernible
Another interesting aspect of our data to analyze is the correlation between average yelp scores assigned by viewers and the average sentiment of their reviews utilizing the “afinn” sentiment dictionary. In order to do this, we (again) need to tokenize each word, grouping by coffee shop name. After that we should anti-join all stop words in order to avoid any distortion in our sentiment analysis caused by irrelevant words. Then, we just need to join the sentiment dictionary and find the average value of the corresponding “afinn” valence. After visualizing with ggplot and running a basic linear model, we end up with the following scatterplot:
AustSent <- AustYelp %>% group_by(coffee_shop_name) %>% unnest_tokens(word, Reviewtext) %>% anti_join(custom_stop, by = "word") %>% inner_join(get_sentiments("afinn"), by = "word")
AustSent %<>% group_by(coffee_shop_name) %>% mutate(meanstar = mean(star_rating), meanscore = mean(score)) %>% select(coffee_shop_name, meanstar, meanscore) %>%unique()
linfit <- ggplot(AustSent,aes(meanstar, meanscore, word = coffee_shop_name, size = 0.1, color = meanscore/meanstar)) + geom_point() + geom_abline(intercept = -1.1571, slope = 0.608, color = "red")
ggplotly(linfit)
Now, while there are some outliers we can tell that there’s some obvious correlation between the average aggregate sentiment of reviews, and the average assigned star rating by users.To figure out the pearson correlation coefficient and thus the coefficient of determination (R^2), we can use the “cor” function.
cor(AustSent$meanstar, AustSent$meanscore)
## [1] 0.7057827
Sentiment analysis has a multitude of practical analytical applications, not only in hospitality, but in areas such as: journalism, financial analysis, and product reviews. Recently, Kaggle had a competition in collaboration with two sigma, that hoped to find a way to tie in the effects of news on stocks, and market movement. Sentiment analysis would be a great component of such a project.
I hope this blogpost was as informative to read as it was a please to write! Until next time.