I have a very large data-set that contains reviews from Trip-advisor and I am using the grepl function to count the occurrences of a word. Furthermore, I wish to calculate a conditional probability.
I wish to count the amount of times the word 'but' appears in the data frame and in addition to this, which reviews that contain 'but' are also helpful reviews. A helpful review can be classified as any review with a number of votes > 0. The code I have so far states which reviews are helpful and counts the the amount of times 'but' appears, but I want it to output a specific number.
Desired output = 'but' appears 1800 times 900 of those 1800 times, the review is helpful
Example of two reviews containing 'but':
"what a fantastic experience! this hotel has everything, amazing staff, gorgeous pool, super cool atmosphere, beautiful rooms and just a hop and a skip away from the busy kamari strip. it's about 5 minute walk from the main strip but you'll be glad not to be in the thick of things, this place is truly a sanctuary and the best!!"
"we stayed here for a week at the beginning of may. this was the start of the season, but the staff were full of enthusiasm and very friendly and helpful. despite the fact our thomson rep came to the hotel every morning we still went to the hotel staff for advice and recommendations for the island. they booked reservations for restaurants we visited and also arranged car hire for us. the hotel itself is very clean and tidy, you are welcomed into a courtyard area with palm trees and rustic seating areas. the reception is nice and bright and clean with a massive bookshelf full of books that you can borrow during your stay. breakfast area is nicely organised and the food is very good."
dfRev <- read.csv("reviews_final.csv", row.names = 1, stringsAsFactors = FALSE)
dfRev$review_body <- tolower(dfRev$review_body)
View(dfRev)
ifelse(dfRev$helpful_votes > 0, "Review helpful", "Review not helpful")
dfRev <- tolower(dfRev)
dfRev$but <- grepl("but", dfRev$review_body)
CodePudding user response:
You can count the number of TRUE
using sum()
. For example:
sum(TRUE)
[1] 1
sum(c(TRUE, TRUE))
[1] 2
sum(c(TRUE, TRUE, FALSE, FALSE, TRUE))
[1] 3
You can compare two vectors of logicals and return a count of pairs that are both TRUE
:
a <- c(TRUE, FALSE, TRUE)
b <- c(FALSE, TRUE, TRUE)
sum(a & b)
[1] 1
With your criteria:
data.frame(
"n_but" = sum(grepl("but", dfRev$review_body)),
"n_helpful" = sum(dfRev$helpful_votes > 0),
"n_but_helpful" = sum(grepl("but", dfRev$review_body) & dfRev$helpful_votes > 0))