I'm trying to reproduce this code from python to R:
# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)
# Keep the highest rating from each user and drop the rest
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')
and I've done this piece of code in R:
reviews_df <-df[order(-df$review_overall), ]
library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)
The problem is that I'm getting with python 1496263 records and with R 1496596 records.
link to dataset: dataset
Can someone help me to see my mistakes?
CodePudding user response:
Without having some data, it's difficult to help, but you might be looking for:
library(tidyverse)
df_clean <- reviews_df %>%
arrange(desc(review_overall)) %>%
distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)
This code will sort descending by review_overall and look for every profilename beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.