I have a dataframe that has duplicates based on their identifying ID, but some of the columns are different. I'd like to keep the rows (or the duplicates) that have the extra bit of info. The structure of the df is as such.
id <- c("3235453", "3235453", "21354315", "21354315", "2121421")
Plan_name<- c("angers", "strasbourg", "Benzema", "angers", "montpellier")
service_line<- c("", "AMRS", "", "Therapy", "")
treatment<-c("", "MH", "", "MH", "")
df <- data.frame (id, Plan_name, treatment, service_line)
As you can see, the ID row has duplicates, but I'd like to keep the second duplicate where there is more info in treatment
and service_line
.
I have tried using
df[duplicated(df[,c(1,3)]),]
but it doesn't work as an empty df is returned. Any suggestions?
CodePudding user response:
Maybe you want something like this:
First we replace all blank with NA, then we arrange be Section.B
and finally slice()
first row from group:
library(dplyr)
df %>%
mutate(across(-c(id, Plan_name),~ifelse(.=="", NA, .))) %>%
group_by(id) %>%
arrange(Section.B, .by_group = TRUE) %>%
slice(1)
id Plan_name Section.B Section.C
<chr> <chr> <chr> <chr>
1 2121421 montpellier NA NA
2 21354315 angers MH Therapy
3 3235453 strasbourg MH AMRS
CodePudding user response:
Try with
library(dplyr)
df %>%
filter(if_all(treatment:service_line, ~ .x != ""))
-output
id Plan_name Section.B Section.C
1 3235453 strasbourg MH AMRS
2 21354315 angers MH Therapy
If we need ids with blanks and not duplicated as well
df %>%
group_by(id) %>%
filter(n() == 1|if_all(treatment:service_line, ~ .x != "")) %>%
ungroup
-output
# A tibble: 3 × 4
id Plan_name treatment service_line
<chr> <chr> <chr> <chr>
1 3235453 strasbourg "MH" "AMRS"
2 21354315 angers "MH" "Therapy"
3 2121421 montpellier "" ""