Identifying list elements contained in another list that are both elements of a data frame-CodePudding

I have two data frames, DF1, DF2, each with two columns (a, b). One column (a) is a unique identifier the other is a column (b) with elements that contain a list. The list contains label names. I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a. The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame. As some background, this data is from a phylogenetic tree. The DF2 is a subsample of DF1. All tips in DF2 are contained in DF1. I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.

It would be easier if I explain with an example:

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b. This new column is made by df1$a (the unique identifier). In the example, df2$c (in order would be)

c("1486,1484,1483,1485,1479,1482,1481,1480")

To take the first two as an example:

df2$a is c("KC792204", "KF150733")

This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484. Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched. The next list in df2$b is c("KC792204", "KF150733", "KC791848"). This result is 1484, since only list 1484 in df$1b contains those three labels.

I have tried:

df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')

But I am instead comparing the lists as a whole rather than the elements inside each list. I also need to find the smallest of the lists that contain the searched labels.

CodePudding user response：

Here is one option:

library(data.table) # for %chin%

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

df2$c <- df1$a[
  Rfast::colMaxs(
    outer(
      seq_along(df1$b),
      seq_along(df2$b),
      function(i, j) mapply(
        function(x, y) all(y %chin% x),
        df1$b[i],
        df2$b[j]
      )
    )/lengths(df1$b)
  )
]
df2$c
#> [1] 1486 1484 1483 1485 1479 1482 1481 1480

If it's possible for a row to have no match, then the above should be modified:

m <- outer(
  seq_along(df1$b),
  seq_along(df2$b),
  function(i, j) mapply(
    function(x, y) all(y %chin% x),
    df1$b[i],
    df2$b[j]
  )
)
df2$c <- ifelse(colSums(m) == 0L, NA, df1$a[Rfast::colMaxs(m/lengths(df1$b))])

CodePudding user response：

Here is an approach, using data.table, and a helper function

library(data.table)
setDT(df1)[, l:=sapply(b,length)]
f <- function(k) df1[sapply(df1$b,\(i) all(k %chin% i))][l==min(l),a]
setDT(df2)[, c:=sapply(b,f)]

Output (df2)

       a                                   b     c
   <num>                              <list> <num>
1:  8645                   KC792204,KF150733  1486
2:  1247          KC792204,KC792205,KC791848  1484
3:  5879                   KJ201900,KF535207  1483
4:  1548 KC792204,JX987740,KX148108,JX987724  1485
5:  2487 GU731662,GU731661,MT826960,GU731663  1479
6:  1245 KJ201900,KJ201899,AB817119,AB817100  1482
7:  1247          GU731661,KP319229,KY428876  1481
8:  3695                   GU731662,MT826960  1480

Explanation:

Line 1: load library
Line 2: adds a column to df1 that indicates the length (l) of the vector in b
Line 3: defines helper function (f) that receives a character vector (k), checks to limit the rows in df1 to those for which all elements of k are found in b, and of these rows, returns the a value for which l is minimized
Line 4: applies f to each value of b in df2, assigning the result to c