Home > Back-end >  Using by argument programmatically in join functions dplyr
Using by argument programmatically in join functions dplyr

Time:07-21

I want to use the by argument of left_join function of dplyr package, passing objects with the string names which corresponds to column names of two different tables to by joined.

here is the code to build the two tables that I want to join:

library(tidyverse)
v1<-letters[1:3] %>% factor
v2<-letters[5:7] %>% factor
yname<-"W"

names_vars<-c("sex","st")

all_comb<-expand.grid(v1,v2)
colnames(all_comb)<-names_vars

f1<-paste0(names_vars,collapse = "*") %>% {paste0("~ ",.)} %>% formula

mat <- model.matrix(f1, all_comb)
int_terms<- colnames(mat)[-1] %>% str_subset(":")
terms<-setdiff(colnames(mat)[-1],int_terms)
subindx<-terms %>% str_remove_all(paste0(names_vars,collapse = "|"))

nterms<-data.frame(var=names_vars,
                   nterms=all_comb %>% apply(2,function(x)length(unique(x))-1))

terms_df<-data.frame(var=rep(nterms$var,nterms$nterms),terms=subindx,type="s")
terms_df$xs<-paste0("x_",1:nrow(terms_df))

int_terms_df<-int_terms %>% str_split_fixed(":",2) %>% data.frame()
colnames(int_terms_df)<-names_vars
int_terms_df<-int_terms_df %>% 
  apply(1:2,function(x)str_remove_all(x,paste0(names_vars,collapse = "|"))) %>% 
  data.frame()

(Sorry for the lengthy code to produce two tables). The two tables to be joined are int_terms_df and terms_df

int_terms_df
  sex st
1   b  f
2   c  f
3   b  g
4   c  g

terms_df[,c("terms","xs")]
  terms  xs
1     b x_1
2     c x_2
3     f x_3
4     g x_4

the join that I want to do is this:

left_join(int_terms_df,terms_df[,c("terms","xs")],by=c("sex"="terms")) %>% 
  left_join(terms_df[,c("terms","xs")],by=c("st"="terms")) %>% 
  mutate(xs=paste0(xs.x," ",xs.y),.keep ="unused")

Producing this table:

  sex st      xs
1   b  f x_1 x_3
2   c  f x_2 x_3
3   b  g x_1 x_4
4   c  g x_2 x_4

The problem is that this code will be the body of a function and I will need to operate over the by argument passing objects with the string name of the corresponding columns to the left_join function.

I have reviewed these resources (r1,r2) but the solutions provided are not working in my case, for example from r2:

xtemp<-"terms"
left_join(int_terms_df,terms_df[,c("terms","xs")],by=setNames(names_vars[1], xtemp))

produces the next error message:

Error in `left_join()`:
! Join columns must be present in data.
✖ Problem with `terms`.
Run `rlang::last_error()` to see where the error occurred.

I do note that examples presented in r1 and r2 are joins on the same data frame, in this case the tables to be joined are different data frames, I believe that this could be the source of failure for my intent to implement this in my case.

I appreciate any comments on this.

CodePudding user response:

The naming should be reversed in setNames i.e. whatever is the name of the variable, it should match the first dataset by variable

library(dplyr)
left_join(int_terms_df,terms_df[,c("terms","xs")],
      by=setNames(xtemp, names_vars[1]))

-output

 sex st  xs
1   b  f x_1
2   c  f x_2
3   b  g x_1
4   c  g x_2

It is also TRUE for right_join

> right_join(int_terms_df,terms_df[,c("terms","xs")],
       by=setNames(xtemp, names_vars[1]))
  sex   st  xs
1   b    f x_1
2   c    f x_2
3   b    g x_1
4   c    g x_2
5   f <NA> x_3
6   g <NA> x_4
  • Related