How to check if a data frame contains a value defined in another data frame-CodePudding

I am trying to solve a problem: to create a new variable with values 0 or 1 to define if a data frame contains a value stored in one column. I asked this question previously (Check if subset contains said value) and I got a tip to use setDT() function. It works, but I have trouble using it for two conditions. To explain briefly:

I have two data frames:

###df 
sub_ID <- c("sub01", "sub01", "sub01", "sub01", "sub01", "sub01") 
condition <- c("A", "A", "A", "B", "B", "B") 
number <- c(1,2,3,1,2,3) 
df <- data.frame(sub_ID, condition, number) 

###test_df 
test_sub_ID <- c("sub01", "sub01", "sub01", "sub01", "sub01", "sub01") 
test_condition <- c("A", "A", "B", "B", "C", "C") 
test_number <- c(2,7,1,4,8,9) 
test_df <- data.frame(test_sub_ID, test_condition, test_number)

I want to check row-wise, if each number from df is present in test_df, according to two features. So - if in df I have the first row, the number is 1, I want to check if in test_df there was 1 for sub01, in condition A, together.
I got the answer that I should use this function: setDT(df)[test_df, new := number %in% test_number, on = .(sub_ID = test_sub_ID, condition = test_condition)]

Thank you, @akrun!

The problem is I want to have two conditions included at the same time, meanwhile, it seems not to work as I want it to.

Do you have any ideas on how to solve this problem?

CodePudding user response：

You could loop over the rows of df using subset in apply and count the nrow.

df$test <- apply(df, 1, \(x) nrow(subset(test_df, test_sub_ID == x[1] & 
                                           test_condition == x[2] & 
                                           test_number == x[3])))

df
#   sub_ID condition number test
# 1  sub01         A      1    0
# 2  sub01         A      2    1
# 3  sub01         A      3    0
# 4  sub01         B      1    1
# 5  sub01         B      2    0
# 6  sub01         B      3    0

This is similar to this for loop which should help you to understand the code and how to write such code by yourself. Use the apply above though because it is much faster!

df$test <- NA
for (i in seq_len(nrow(df))) {
  df$test[i] <- nrow(subset(test_df, test_sub_ID == df[i, 1] & 
                              test_condition == df[i, 2] & 
                              test_number == df[i, 3]))
}

CodePudding user response：

You can extend your data.table join approach:

setnames(setDT(test_df), new=names(df))
test_df[setDT(df), j=.N, by=.EACHI, on=.(sub_ID, condition, number)]

Output:

   sub_ID condition number test
1:  sub01         A      1    0
2:  sub01         A      2    1
3:  sub01         A      3    0
4:  sub01         B      1    1
5:  sub01         B      2    0
6:  sub01         B      3    0

Note: I use setnames() as a convenience only: Making the names of test_df the same as the names of df facilitates a more concise on= clause.