I want to use the by
argument of left_join
function of dplyr
package, passing objects with the string names which corresponds to column names of two different tables to by joined.
here is the code to build the two tables that I want to join:
library(tidyverse)
v1<-letters[1:3] %>% factor
v2<-letters[5:7] %>% factor
yname<-"W"
names_vars<-c("sex","st")
all_comb<-expand.grid(v1,v2)
colnames(all_comb)<-names_vars
f1<-paste0(names_vars,collapse = "*") %>% {paste0("~ ",.)} %>% formula
mat <- model.matrix(f1, all_comb)
int_terms<- colnames(mat)[-1] %>% str_subset(":")
terms<-setdiff(colnames(mat)[-1],int_terms)
subindx<-terms %>% str_remove_all(paste0(names_vars,collapse = "|"))
nterms<-data.frame(var=names_vars,
nterms=all_comb %>% apply(2,function(x)length(unique(x))-1))
terms_df<-data.frame(var=rep(nterms$var,nterms$nterms),terms=subindx,type="s")
terms_df$xs<-paste0("x_",1:nrow(terms_df))
int_terms_df<-int_terms %>% str_split_fixed(":",2) %>% data.frame()
colnames(int_terms_df)<-names_vars
int_terms_df<-int_terms_df %>%
apply(1:2,function(x)str_remove_all(x,paste0(names_vars,collapse = "|"))) %>%
data.frame()
(Sorry for the lengthy code to produce two tables). The two tables to be joined are int_terms_df
and terms_df
int_terms_df
sex st
1 b f
2 c f
3 b g
4 c g
terms_df[,c("terms","xs")]
terms xs
1 b x_1
2 c x_2
3 f x_3
4 g x_4
the join that I want to do is this:
left_join(int_terms_df,terms_df[,c("terms","xs")],by=c("sex"="terms")) %>%
left_join(terms_df[,c("terms","xs")],by=c("st"="terms")) %>%
mutate(xs=paste0(xs.x," ",xs.y),.keep ="unused")
Producing this table:
sex st xs
1 b f x_1 x_3
2 c f x_2 x_3
3 b g x_1 x_4
4 c g x_2 x_4
The problem is that this code will be the body of a function and I will need to operate over the by
argument passing objects with the string name of the corresponding columns to the left_join
function.
I have reviewed these resources (r1,r2) but the solutions provided are not working in my case, for example from r2:
xtemp<-"terms"
left_join(int_terms_df,terms_df[,c("terms","xs")],by=setNames(names_vars[1], xtemp))
produces the next error message:
Error in `left_join()`:
! Join columns must be present in data.
✖ Problem with `terms`.
Run `rlang::last_error()` to see where the error occurred.
I do note that examples presented in r1 and r2 are joins on the same data frame, in this case the tables to be joined are different data frames, I believe that this could be the source of failure for my intent to implement this in my case.
I appreciate any comments on this.
CodePudding user response:
The naming should be reversed in setNames
i.e. whatever is the name of the variable, it should match the first dataset by
variable
library(dplyr)
left_join(int_terms_df,terms_df[,c("terms","xs")],
by=setNames(xtemp, names_vars[1]))
-output
sex st xs
1 b f x_1
2 c f x_2
3 b g x_1
4 c g x_2
It is also TRUE for right_join
> right_join(int_terms_df,terms_df[,c("terms","xs")],
by=setNames(xtemp, names_vars[1]))
sex st xs
1 b f x_1
2 c f x_2
3 b g x_1
4 c g x_2
5 f <NA> x_3
6 g <NA> x_4