I have a dataset with duplicated rows in the columns cell line, pathway, drug but the activity column has a different output. For example in the first two rows of the data frame below everything is the same from cell, drug to pathway except the activity, the first row has RESISTANT in the activity column and the second row has SENSITIVE in the activity column. I want to retain the second row, which has an output of SENSITIVE in the activity.
Could you please assist on how I can go about to do that. I want to do this for all rows in the data frame that have an output like that, I want to retain the second duplicated row.
**cell** **drug** **pathway** **activity**
AU656 5-FLORO OTHER RESISTANT
AU656 5-FLORO OTHER SENSITIVE
AU656 ALISERTIB MITOSIS INTERMEDIATE
AU656 ALISERTIB MITOSIS RESISTANT
AU656 AFITINIB EGFR SENSITIVE
AU656 AZD6482 PI3K INTERMEDIATE
AU656 DORAMAPIMOD JNK INTERMEDIATE
AU656 DORAMAPIMOD JNK SENSITIVE
CodePudding user response:
We group by cell, drug, pathway and slice
the 2nd row (if exists) by taking the min
imum of 2 and group size (n()
), thus for group sizes of 1, it returns the first row
library(dplyr)
df1 %>%
group_by(cell, drug, pathway) %>%
slice(min(2, n())) %>%
ungroup
-output
# A tibble: 5 × 4
cell drug pathway activity
<chr> <chr> <chr> <chr>
1 AU656 5-FLORO OTHER SENSITIVE
2 AU656 AFITINIB EGFR SENSITIVE
3 AU656 ALISERTIB MITOSIS RESISTANT
4 AU656 AZD6482 PI3K INTERMEDIATE
5 AU656 DORAMAPIMOD JNK SENSITIVE
data
df1 <- structure(list(cell = c("AU656", "AU656", "AU656", "AU656", "AU656",
"AU656", "AU656", "AU656"), drug = c("5-FLORO", "5-FLORO", "ALISERTIB",
"ALISERTIB", "AFITINIB", "AZD6482", "DORAMAPIMOD", "DORAMAPIMOD"
), pathway = c("OTHER", "OTHER", "MITOSIS", "MITOSIS", "EGFR",
"PI3K", "JNK", "JNK"), activity = c("RESISTANT", "SENSITIVE",
"INTERMEDIATE", "RESISTANT", "SENSITIVE", "INTERMEDIATE", "INTERMEDIATE",
"SENSITIVE")), class = "data.frame", row.names = c(NA, -8L))