Home > Software engineering >  Matching two data frames with repeated keys in R
Matching two data frames with repeated keys in R

Time:03-22

I want to match two data frames based on their IDs and append the repeated selection into a single list. Examples of the data frames I have are as below:

df1 = data.frame(ID = c("A", "B", "C", "D"), count = c(4, 6, 7, 8))
df2 = data.frame(ID = c("A", "B", "C", "D"), direction = c("up", "down", "up", "up"), reference = c(38, 39, 40, 40))

The two data frames look something like this:

ID count
A 4
B 6
C 7
D 8
ID direction reference
A up 38
A down 39
C up 40
D up 40

Essentially, I want the final result to look like this, where the information from the df2 can be merged with df1 but with details concatenated into a single list and if multiple corresponding entries were found in df2, they can be stored as a list of lists:

ID count Info
A 4 c(up = 38, down = 39)
B 6 NA
C 7 c(up =40)
D 8 c(up =40)

I have tried merging two data frames, but this will result in multiple entries repeated as a new row. I'm considering trying to parse through the second data frame for a match as I did with the database environment previously, but I would appreciate it if you can let me know how to do such using data frame instead of environment. Of course, if there's a more efficient approach to this, that would be much appreciated too! Example of what I did to match df1 onto a database environment:

i=1
for (row in 1:nrow(df)){
  tmp <- paste(df$ID[i])
  tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
  if (is.na(tmp2) == TRUE) {
    tmp <- paste(df$simpleID[i])
    tmp2 <- as.list(mget(tmp, mirbaseID2ACC, ifnotfound = NA))
  }
  if (i<(nrow(df) 1)) {
    df$ACCESSION[i] <- tmp2
    i=i 1
  }
}

*Note that the "simple" column is there for the actual data frame because the samples have a very specific ID (ie A-ab-2) but for the purpose of this example, we can assume that the ID itself is simple enough to be detected with the data frame I'm trying to

CodePudding user response:

You can group_by the ID column, then summarise the direction and reference columns into a list (use df3$info to look at the list) and left_join with df1.

library(dplyr)

df3 <- left_join(df1, 
          df2 %>% 
            group_by(ID) %>% 
            summarize(info = list(paste(direction, "=", reference))), 
          by = "ID")

  ID count               info
1  A     4 up = 38, down = 39
2  B     6               NULL
3  C     7            up = 40
4  D     8            up = 40

df3$info
[[1]]
[1] "up = 38"   "down = 39"

[[2]]
NULL

[[3]]
[1] "up = 40"

[[4]]
[1] "up = 40"

CodePudding user response:

Using `==` in outer then apply which as substitute for match which only matches once. Then subset df2 on the found matches and cobble together using paste and toString.

r1 <- cbind(df1, info=sapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
  if (!length(x) == 0)
    toString(Reduce(\(y, z) paste0(y, '=', z), df2[x, -1]))
  else NA
}))
r1
#   ID count           info
# 1  A     4 up=38, down=39
# 2  B     6           <NA>
# 3  C     7          up=40
# 4  D     8          up=40

Or, to get the desired list column:

r2 <- cbind(df1, info=I(lapply(apply(outer(df1$ID, df2$ID, `==`), 1, which), \(x) {
  if (!length(x) == 0)
    Reduce(\(y, z) paste0(y, '=', z), df2[x, -1])
  else NA
})))
r2
#   ID count         info
# 1  A     4 up=38, d....
# 2  B     6           NA
# 3  C     7        up=40
# 4  D     8        up=40

where

r2$info
# [[1]]
# [1] "up=38"   "down=39"
# 
# [[2]]
# [1] NA
# 
# [[3]]
# [1] "up=40"
# 
# [[4]]
# [1] "up=40"

Note: R >= 4.1 used.


Data:

df1 <- structure(list(ID = c("A", "B", "C", "D"), count = c(4, 6, 7, 
8)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(ID = c("A", "A", "C", "D"), direction = c("up", 
"down", "up", "up"), reference = c(38, 39, 40, 40)), class = "data.frame", row.names = c(NA, 
-4L))
  • Related