I have two data frames, DF1, DF2, each with two columns (a, b). One column (a) is a unique identifier the other is a column (b) with elements that contain a list. The list contains label names. I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a. The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame. As some background, this data is from a phylogenetic tree. The DF2 is a subsample of DF1. All tips in DF2 are contained in DF1. I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.
It would be easier if I explain with an example:
df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))
df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))
I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b. This new column is made by df1$a (the unique identifier). In the example, df2$c (in order would be)
c("1486,1484,1483,1485,1479,1482,1481,1480")
To take the first two as an example:
df2$a is c("KC792204", "KF150733")
This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484.
Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched. The next list in df2$b is c("KC792204", "KF150733", "KC791848")
. This result is 1484, since only list 1484 in df$1b contains those three labels.
I have tried:
df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')
But I am instead comparing the lists as a whole rather than the elements inside each list. I also need to find the smallest of the lists that contain the searched labels.
CodePudding user response:
Here is one option:
library(data.table) # for %chin%
df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))
df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))
df2$c <- df1$a[
Rfast::colMaxs(
outer(
seq_along(df1$b),
seq_along(df2$b),
function(i, j) mapply(
function(x, y) all(y %chin% x),
df1$b[i],
df2$b[j]
)
)/lengths(df1$b)
)
]
df2$c
#> [1] 1486 1484 1483 1485 1479 1482 1481 1480
If it's possible for a row to have no match, then the above should be modified:
m <- outer(
seq_along(df1$b),
seq_along(df2$b),
function(i, j) mapply(
function(x, y) all(y %chin% x),
df1$b[i],
df2$b[j]
)
)
df2$c <- ifelse(colSums(m) == 0L, NA, df1$a[Rfast::colMaxs(m/lengths(df1$b))])
CodePudding user response:
Here is an approach, using data.table
, and a helper function
library(data.table)
setDT(df1)[, l:=sapply(b,length)]
f <- function(k) df1[sapply(df1$b,\(i) all(k %chin% i))][l==min(l),a]
setDT(df2)[, c:=sapply(b,f)]
Output (df2)
a b c
<num> <list> <num>
1: 8645 KC792204,KF150733 1486
2: 1247 KC792204,KC792205,KC791848 1484
3: 5879 KJ201900,KF535207 1483
4: 1548 KC792204,JX987740,KX148108,JX987724 1485
5: 2487 GU731662,GU731661,MT826960,GU731663 1479
6: 1245 KJ201900,KJ201899,AB817119,AB817100 1482
7: 1247 GU731661,KP319229,KY428876 1481
8: 3695 GU731662,MT826960 1480
Explanation:
- Line 1: load library
- Line 2: adds a column to
df1
that indicates the length (l
) of the vector inb
- Line 3: defines helper function (
f
) that receives a character vector (k
), checks to limit the rows indf1
to those for which all elements ofk
are found inb
, and of these rows, returns thea
value for whichl
is minimized - Line 4: applies
f
to each value ofb
indf2
, assigning the result toc