Home > OS >  Selection of unique lines with an additional condition
Selection of unique lines with an additional condition

Time:08-24

Sample dataset

df <- data.frame (co = c(11.5,1.3,7.8,2.3,2.3,3.1,5.7,5.7,9.3), 
                  factor = c(NA,NA,NA,3,NA,5,NA,6,0.3), 
                  condition = c (NA,NA,NA,12.3,NA,13.5,NA,18.7,NA)))

I want to remove duplicate lines against the variable co.

df.2 <- distinct(df, co, .keep_all = TRUE)

I get the following result:

    co factor condition
1 11.5     NA        NA
2  1.3     NA        NA
3  7.8     NA        NA
4  2.3    3.0      12.3
5  3.1    5.0      13.5
6  5.7     NA        NA
7  9.3    0.3        NA

I would like the end result to be as follows

    co factor condition
1 11.5     NA        NA
2  1.3     NA        NA
3  7.8     NA        NA
4  2.3    3.0      12.3
5  3.1    5.0      13.5
6  5.7    6.0      18.7
7  9.3    0.3        NA

Row where the value of factor is greater than the duplicate row with respect to the variable which is better (in this case for co = 5.7 factor is NA, but it may be a coincidence that co = 5.7; factor = 5.5, condition = 11.2, then I want to get 5.7; 6; 18.7 anyway)

CodePudding user response:

You can first arrange your data, so that the records with NA will be at the end of the dataframe, then do your distinct.

Edit: Since you've updated your question, I also updated my answer. You can use arrange(desc(factor)) to select rows with the highest value.

library(dplyr)

df %>% 
  arrange(co, desc(factor), desc(condition)) %>% 
  distinct(co, .keep_all = T)  

    co factor condition
1  1.3     NA        NA
2  2.3    3.0      12.3
3  3.1    5.0      13.5
4  5.7    6.0      18.7
5  7.8     NA        NA
6  9.3    0.3        NA
7 11.5     NA        NA
  • Related