Matching two data frames with repeated keys in R-CodePudding

I want to match two data frames based on their IDs and append the repeated selection into a single list. Examples of the data frames I have are as below:

df1 = data.frame(ID = c("A", "B", "C", "D"), count = c(4, 6, 7, 8))
df2 = data.frame(ID = c("A", "B", "C", "D"), direction = c("up", "down", "up", "up"), reference = c(38, 39, 40, 40))

The two data frames look something like this:

ID	count
A	4
B	6
C	7
D	8

ID	direction	reference
A	up	38
A	down	39
C	up	40
D	up	40

Essentially, I want the final result to look like this, where the information from the df2 can be merged with df1 but with details concatenated into a single list and if multiple corresponding entries were found in df2, they can be stored as a list of lists:

ID	count	Info
A	4	c(up = 38, down = 39)
B	6	NA
C	7	c(up =40)
D	8	c(up =40)

I have tried merging two data frames, but this will result in multiple entries repeated as a new row. I'm considering trying to parse through the second data frame for a match as I did with the database environment previously, but I would appreciate it if you can let me know how to do such using data frame instead of environment. Of course, if there's a more efficient approach to this, that would be much appreciated too! Example of what I did to match df1 onto a database environment:

i=1
for (row in 1:nrow(df)){
  tmp <- paste(df$ID[i])
  tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
  if (is.na(tmp2) == TRUE) {
    tmp <- paste(df$simpleID[i])
    tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
  }
  if (i<(nrow(df) 1)) {
    df$ACCESSION[i] <- tmp2
    i=i 1
  }
}

*Note that the "simple" column is there for the actual data frame because the samples have a very specific ID (ie A-ab-2) but for the purpose of this example, we can assume that the ID itself is simple enough to be detected with the data frame I'm trying to

CodePudding user response：

You can group_by the ID column, then summarise the direction and reference columns into a list (use df3$info to look at the list) and left_join with df1.

library(dplyr)

df3 <- left_join(df1, 
          df2 %>% 
            group_by(ID) %>% 
            summarize(info = list(paste(direction, "=", reference))), 
          by = "ID")

  ID count               info
1  A     4 up = 38, down = 39
2  B     6               NULL
3  C     7            up = 40
4  D     8            up = 40

df3$info
[[1]]
[1] "up = 38"   "down = 39"

[[2]]
NULL

[[3]]
[1] "up = 40"

[[4]]
[1] "up = 40"

CodePudding user response：

Using `==` in outer then apply which as substitute for match which only matches once. Then subset df2 on the found matches and cobble together using paste and toString.

r1 <- cbind(df1, info=sapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
  if (!length(x) == 0)
    toString(Reduce(\(y, z) paste0(y, '=', z), df2[x, -1]))
  else NA
}))
r1
#   ID count           info
# 1  A     4 up=38, down=39
# 2  B     6           <NA>
# 3  C     7          up=40
# 4  D     8          up=40

Or, to get the desired list column:

r2 <- cbind(df1, info=I(lapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
  if (!length(x) == 0)
    Reduce(\(y, z) paste0(y, '=', z), df2[x, -1])
  else NA
})))
r2
#   ID count         info
# 1  A     4 up=38, d....
# 2  B     6           NA
# 3  C     7        up=40
# 4  D     8        up=40

where

r2$info
# [[1]]
# [1] "up=38"   "down=39"
# 
# [[2]]
# [1] NA
# 
# [[3]]
# [1] "up=40"
# 
# [[4]]
# [1] "up=40"

Note: R >= 4.1 used.

Data:

df1 <- structure(list(ID = c("A", "B", "C", "D"), count = c(4, 6, 7, 
8)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(ID = c("A", "A", "C", "D"), direction = c("up", 
"down", "up", "up"), reference = c(38, 39, 40, 40)), class = "data.frame", row.names = c(NA, 
-4L))