I want to match two data frames based on their IDs and append the repeated selection into a single list. Examples of the data frames I have are as below:
df1 = data.frame(ID = c("A", "B", "C", "D"), count = c(4, 6, 7, 8))
df2 = data.frame(ID = c("A", "B", "C", "D"), direction = c("up", "down", "up", "up"), reference = c(38, 39, 40, 40))
The two data frames look something like this:
ID | count |
---|---|
A | 4 |
B | 6 |
C | 7 |
D | 8 |
ID | direction | reference |
---|---|---|
A | up | 38 |
A | down | 39 |
C | up | 40 |
D | up | 40 |
Essentially, I want the final result to look like this, where the information from the df2 can be merged with df1 but with details concatenated into a single list and if multiple corresponding entries were found in df2, they can be stored as a list of lists:
ID | count | Info |
---|---|---|
A | 4 | c(up = 38, down = 39) |
B | 6 | NA |
C | 7 | c(up =40) |
D | 8 | c(up =40) |
I have tried merging two data frames, but this will result in multiple entries repeated as a new row. I'm considering trying to parse through the second data frame for a match as I did with the database environment previously, but I would appreciate it if you can let me know how to do such using data frame instead of environment. Of course, if there's a more efficient approach to this, that would be much appreciated too! Example of what I did to match df1 onto a database environment:
i=1
for (row in 1:nrow(df)){
tmp <- paste(df$ID[i])
tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
if (is.na(tmp2) == TRUE) {
tmp <- paste(df$simpleID[i])
tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
}
if (i<(nrow(df) 1)) {
df$ACCESSION[i] <- tmp2
i=i 1
}
}
*Note that the "simple" column is there for the actual data frame because the samples have a very specific ID (ie A-ab-2) but for the purpose of this example, we can assume that the ID itself is simple enough to be detected with the data frame I'm trying to
CodePudding user response:
You can group_by
the ID
column, then summarise
the direction
and reference
columns into a list (use df3$info
to look at the list) and left_join
with df1
.
library(dplyr)
df3 <- left_join(df1,
df2 %>%
group_by(ID) %>%
summarize(info = list(paste(direction, "=", reference))),
by = "ID")
ID count info
1 A 4 up = 38, down = 39
2 B 6 NULL
3 C 7 up = 40
4 D 8 up = 40
df3$info
[[1]]
[1] "up = 38" "down = 39"
[[2]]
NULL
[[3]]
[1] "up = 40"
[[4]]
[1] "up = 40"
CodePudding user response:
Using `==`
in outer
then apply
which
as substitute for match
which only matches once. Then subset df2
on the found matches and cobble together using paste
and toString
.
r1 <- cbind(df1, info=sapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
if (!length(x) == 0)
toString(Reduce(\(y, z) paste0(y, '=', z), df2[x, -1]))
else NA
}))
r1
# ID count info
# 1 A 4 up=38, down=39
# 2 B 6 <NA>
# 3 C 7 up=40
# 4 D 8 up=40
Or, to get the desired list column:
r2 <- cbind(df1, info=I(lapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
if (!length(x) == 0)
Reduce(\(y, z) paste0(y, '=', z), df2[x, -1])
else NA
})))
r2
# ID count info
# 1 A 4 up=38, d....
# 2 B 6 NA
# 3 C 7 up=40
# 4 D 8 up=40
where
r2$info
# [[1]]
# [1] "up=38" "down=39"
#
# [[2]]
# [1] NA
#
# [[3]]
# [1] "up=40"
#
# [[4]]
# [1] "up=40"
Note: R >= 4.1 used.
Data:
df1 <- structure(list(ID = c("A", "B", "C", "D"), count = c(4, 6, 7,
8)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = c("A", "A", "C", "D"), direction = c("up",
"down", "up", "up"), reference = c(38, 39, 40, 40)), class = "data.frame", row.names = c(NA,
-4L))