How to replicate python code to R to find duplicates?-CodePudding

I'm trying to reproduce this code from python to R:

# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)

# Keep the highest rating from each user and drop the rest 
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')

and I've done this piece of code in R:

reviews_df <-df[order(-df$review_overall), ]

library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)

The problem is that I'm getting with python 1496263 records and with R 1496596 records.

link to dataset: dataset

Can someone help me to see my mistakes?

CodePudding user response：

Without having some data, it's difficult to help, but you might be looking for:

library(tidyverse)
df_clean <- reviews_df %>%
  arrange(desc(review_overall)) %>%
  distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)

This code will sort descending by review_overall and look for every profilename beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.