Home > Software engineering >  How to select unique rows from data frame subject to conditions using dplyr or base R?
How to select unique rows from data frame subject to conditions using dplyr or base R?

Time:09-20

I have the following data frame called test:

> test
  concat grpRnk
1    1.1      1
2    1.2      1
3    2.1      3
4    2.1      2
5    2.2      3
6    2.2      2
7    3.1      4
8    3.2      4

And I run this bit of dplyr code test %>% distinct(concat, .keep_all = TRUE) to get the following output, showing the unique rows in the concat column:

> test %>% distinct(concat, .keep_all = TRUE)
  concat grpRnk
1    1.1      1
2    1.2      1
3    2.1      3
4    2.2      3
5    3.1      4
6    3.2      4

How do I modify this bit of code to instead remove rows numbers 3 and 5 in the original test data frame where grpRnk was 3 for both? The current bit of code removed those dupes where grpRnk = 2. In base R is fine too!

Here's the code for generating test data frame:

test <- data.frame(concat = c(1.1,1.2,2.1,2.1,2.2,2.2,3.1,3.2),
                   grpRnk = c(1,1,3,2,3,2,4,4))

CodePudding user response:

Obviously, the first case is kept in each case. Therefore you should sort the corresponding variable before.

test %>% 
  arrange(grpRnk) %>% 
  distinct(concat, .keep_all = TRUE) 

If, as you write, it depends on other columns' values, it might be safer to take an intermediate step and create a new variable that shows all multiple cases. This way you have more control and you can delete the cases in a seperate step.

test %>% 
  mutate(dup = duplicated(concat, fromLast = TRUE) | duplicated(concat))
  • Related