Home > Software engineering >  How do I create interaction columns forall columns in tidyverse?
How do I create interaction columns forall columns in tidyverse?

Time:02-24

I am trying to create interaction variables for all 20 variables in a dataframe, so I would have in total 20 base variables and 380 interaction variables. For any single variable, I am able to create a dataframe of 19 variables by using:

in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))

But I am unable to iterate across the columns. I tried to use map over a vector of column names but am unable to get the function inside map to read as.symbol(character). Here is a sample of my data from dput:

structure(list(frpm_frac_s = c(0.870400011539459, 0.904699981212616, 
0.98089998960495, 0.838800013065338, 0.919900000095367, 0.837700009346008, 
0.84799998998642, 0.925999999046326, 0.963900029659271, 0.887899994850159
), enrollment_s = c(364, 608, 571, 705, 566, 838, 421, 757, 693, 
535), ell_frac_s = c(0.46000000834465, 0.334000021219254, 0.300999999046326, 
0.209999993443489, 0.706999957561493, 0.552999973297119, 0.412999987602234, 
0.359000027179718, 0.726000010967255, 0.646999955177307), edi_s = c(8, 
38, 39, 37, 11, 35, 15, 39, 9, 4), te_fte_s = c(23, 22, 20, 25, 
24.5, 36, 18, 30.2999992370605, 24.3999996185303, 19)), row.names = c(NA, 
10L), class = "data.frame")

When using:

 in_sample[3:22] %>%
    transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))

I get:

structure(list(enrollment_s = c(316.825604200363, 550.057588577271, 
560.093894064426, 591.354009211063, 520.663400053978, 701.992607831955, 
357.007995784283, 700.981999278069, 667.982720553875, 475.026497244835
), ell_frac_s = c(0.400384012571335, 0.302169812922072, 0.295250895935631, 
0.17614799724412, 0.650369261028242, 0.463248082799339, 0.350223985351086, 
0.33243402482605, 0.699791432103968, 0.574471256869984), edi_s = c(6.96320009231567, 
34.3785992860794, 38.255099594593, 31.0356004834175, 10.118900001049, 
29.3195003271103, 12.7199998497963, 36.1139999628067, 8.67510026693344, 
3.55159997940063), te_fte_s = c(20.0192002654076, 19.9033995866776, 
19.617999792099, 20.9700003266335, 22.5375500023365, 30.1572003364563, 
15.2639998197556, 28.0577992646217, 23.5191603559875, 16.870099902153
)), row.names = c(NA, 10L), class = "data.frame")

I would like to do this for all variables and then cbind them together. Thank you for your help.

CodePudding user response:

You can use model.matrix to create interaction terms. (This is what's done under the hood in most modeling functions.)

m = model.matrix(~ .^2 - .   0, data = df)
m
#    frpm_frac_s:enrollment_s frpm_frac_s:ell_frac_s frpm_frac_s:edi_s frpm_frac_s:te_fte_s
# 1                  316.8256              0.4003840            6.9632             20.01920
# 2                  550.0576              0.3021698           34.3786             19.90340
# 3                  560.0939              0.2952509           38.2551             19.61800
# 4                  591.3540              0.1761480           31.0356             20.97000
# 5                  520.6634              0.6503693           10.1189             22.53755
# 6                  701.9926              0.4632481           29.3195             30.15720
# 7                  357.0080              0.3502240           12.7200             15.26400
# 8                  700.9820              0.3324340           36.1140             28.05780
# 9                  667.9827              0.6997914            8.6751             23.51916
# 10                 475.0265              0.5744713            3.5516             16.87010
#    enrollment_s:ell_frac_s enrollment_s:edi_s enrollment_s:te_fte_s ell_frac_s:edi_s
# 1                  167.440               2912                8372.0            3.680
# 2                  203.072              23104               13376.0           12.692
# 3                  171.871              22269               11420.0           11.739
# 4                  148.050              26085               17625.0            7.770
# 5                  400.162               6226               13867.0            7.777
# 6                  463.414              29330               30168.0           19.355
# 7                  173.873               6315                7578.0            6.195
# 8                  271.763              29523               22937.1           14.001
# 9                  503.118               6237               16909.2            6.534
# 10                 346.145               2140               10165.0            2.588
#    ell_frac_s:te_fte_s edi_s:te_fte_s
# 1              10.5800          184.0
# 2               7.3480          836.0
# 3               6.0200          780.0
# 4               5.2500          925.0
# 5              17.3215          269.5
# 6              19.9080         1260.0
# 7               7.4340          270.0
# 8              10.8777         1181.7
# 9              17.7144          219.6
# 10             12.2930           76.0
# attr(,"assign")
#  [1]  1  2  3  4  5  6  7  8  9 10

Your math is a little off, because order doesn't matter in multiplication there are n * (n - 1) / 2 possibilities, (same as n choose 2), so you should expect 190 columns output for 20 columns input.

I made the formula to only include interaction terms, you can use ~ .^2 0 to include the first order terms too, or ~ .^2 to also include an intercept.

  • Related