Home > Software engineering >  Eliminate duplicated rows based on another column in R
Eliminate duplicated rows based on another column in R

Time:04-25

I really need your help!

I did this dataframe to help you understand my problem:

df <- data.frame(code = c('A', 'A', 'B', 'B', 'B', 'C', 'D', 'D', 'E', 'E'), Value = c('high', 'low', 'medium', 'low', 'high', 'low', 'medium', 'low', 'low', 'medium'))

What I need to do is:

  • eliminate duplicates from the first column based on the second column, so if we have a duplicate and between those rows is present the value 'high' pick that one, otherwise look for 'medium' and in the end 'low'.

In the end I would like to obtain the following dataframe:

final_df <- data.frame(code = c('A', 'B', 'C', 'D', 'E');
                      Value = c('high', 'high', 'low', 'medium', 'medium))

I don't know how to do it, I tried with nested for loops plus several if's but it does not work properly.. Thanks in advance to who is going to help.

CodePudding user response:

df %>%
   group_by(code) %>%
   slice_max(factor(Value, c('low', 'medium', 'high')))

# A tibble: 5 x 2
# Groups:   code [5]
  code  Value 
  <chr> <chr> 
1 A     high  
2 B     high  
3 C     low   
4 D     medium
5 E     medium

CodePudding user response:

We can arrange the columns and then do distinct on the 'code' column

library(dplyr)
df %>% 
 arrange(code, match(Value, c("high", "medium", "low"))) %>% 
 distinct(code, .keep_all = TRUE)

-output

  code  Value
1    A   high
2    B   high
3    C    low
4    D medium
5    E medium

If we want a base R solution

df1 <- df[with(df, order(code, match(Value, c("high", "medium", "low")))),]
df1[!duplicated(df1$code),]
  • Related