I really need your help!
I did this dataframe to help you understand my problem:
df <- data.frame(code = c('A', 'A', 'B', 'B', 'B', 'C', 'D', 'D', 'E', 'E'), Value = c('high', 'low', 'medium', 'low', 'high', 'low', 'medium', 'low', 'low', 'medium'))
What I need to do is:
- eliminate duplicates from the first column based on the second column, so if we have a duplicate and between those rows is present the value 'high' pick that one, otherwise look for 'medium' and in the end 'low'.
In the end I would like to obtain the following dataframe:
final_df <- data.frame(code = c('A', 'B', 'C', 'D', 'E');
Value = c('high', 'high', 'low', 'medium', 'medium))
I don't know how to do it, I tried with nested for loops plus several if's but it does not work properly.. Thanks in advance to who is going to help.
CodePudding user response:
df %>%
group_by(code) %>%
slice_max(factor(Value, c('low', 'medium', 'high')))
# A tibble: 5 x 2
# Groups: code [5]
code Value
<chr> <chr>
1 A high
2 B high
3 C low
4 D medium
5 E medium
CodePudding user response:
We can arrange
the columns and then do distinct
on the 'code' column
library(dplyr)
df %>%
arrange(code, match(Value, c("high", "medium", "low"))) %>%
distinct(code, .keep_all = TRUE)
-output
code Value
1 A high
2 B high
3 C low
4 D medium
5 E medium
If we want a base R
solution
df1 <- df[with(df, order(code, match(Value, c("high", "medium", "low")))),]
df1[!duplicated(df1$code),]