Home > Enterprise >  Matching values in different datasets by groups in R
Matching values in different datasets by groups in R

Time:04-20

I have the following two datasets:

df1 <- data.frame(
  "group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5), 
  "numbers" = c(55, 75, 60, 55, 75, 60,  55, 75, 60,  55, 75, 60,  55, 75, 60))
df2 <- data.frame(
  "group" = c(1, 1, 2, 2, 2, 3, 3, 4, 5), 
  "P1" = c(55, NA, 60, 55, 75, 75, 55, 55, 60),
  "P2" = c(55, 75, 55, 60, NA, 75, 55, NA, 60),
  "P3" = c(75, 55, 60, 75, NA, 75, 60, 55, 60))

In df1 each group has the same three numbers (in reality there are around 500 numbers).

I want to check whether the values in the column "numbers" in df1 are contained in the columns P1, P2, and P3 of df2. There are two problems I am stuck with. 1. the values in the numbers column of df1 can occur in different groups in df2 (defined by the group column in df1 and df2). 2. the datasets have different lengths. Is there a way to merge both datasets and have the following dataset:

df3 <- data.frame(
  "group"    = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5), 
  "numbers"  = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60,),
  "P1new"    = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1),
  "P2new"    = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1),
  "P3new"    = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1))

where P1new (P2new and P3new respectively) contain the value 1 if df2$P1 contains the value in df1$numbers within the correct group (as I said numbers can reoccur in different groups). For example, P3 has the value 75 in group 1 but not in group 5. So in group 1 P3new would have a 1 and in group 5 P3new would have a 0. This question is similar to Find matching values in different datasets by groups in R but I could not adapt the code according to my objectives. So, I would really appreciate any help.

CodePudding user response:

Interesting question. Here's a way with dplyr functions:

library(dplyr)
df2 %>% 
  group_by(group) %>% 
  summarise(across(P1:P3, ~ list(unique(na.omit(.x))))) %>% 
  inner_join(df1, .) %>% 
  rowwise() %>% 
  mutate(across(P1:P3, ~  (numbers %in% .x)))

   group numbers    P1    P2    P3
   <dbl>   <dbl> <int> <int> <int>
 1     1      55     1     1     1
 2     1      75     0     1     1
 3     1      60     0     0     0
 4     2      55     1     1     0
 5     2      75     1     0     1
 6     2      60     1     1     1
 7     3      55     1     1     0
 8     3      75     1     1     1
 9     3      60     0     0     1
10     4      55     1     0     1
11     4      75     0     0     0
12     4      60     0     0     0
13     5      55     0     0     0
14     5      75     0     0     0
15     5      60     1     1     1
  • Related