Home > database >  Programming a linear regression R model formula for 100 features to have an interaction with one
Programming a linear regression R model formula for 100 features to have an interaction with one

Time:02-01

I have a situation where I need to train a regression model that will have 100 features. I want to look for interaction effects between all 100 features and one other feature. I would like to find a way to do this programatically as well since this analysis is going to be recuring and I don't want to have to reprogram a new formula each time this analysis is run. I want it to be automated. So how can I get a model that is like so

Y~a*b   a*c   .... a*z 

But for 100 terms? How do I get the R formula to do this? Note I will be using statsmodels in python but I think the syntax is the same.

CodePudding user response:

lm(Y ~ a * ., df)

eg

lm(Sepal.Width ~ Sepal.Length * ., iris)

Call:
lm(formula = Sepal.Width ~ Sepal.Length * ., data = iris)

Coefficients:
                   (Intercept)                    Sepal.Length                    Petal.Length                     Petal.Width  
                      -0.91350                         0.82954                         0.29569                         0.85334  
             Speciesversicolor                Speciesvirginica       Sepal.Length:Petal.Length        Sepal.Length:Petal.Width  
                       0.05894                        -0.89244                        -0.05394                        -0.04654  
Sepal.Length:Speciesversicolor   Sepal.Length:Speciesvirginica  
                      -0.32823                        -0.21910  

CodePudding user response:

Here is an example of how to construct the wanted string and then convert to a formula

paste("a", letters[2:26], sep = "*")  |>
    paste(collapse = "   ") |>
    sprintf(fmt = "Y ~ %s") |>
    as.formula()
    
##> Y ~ a * b   a * c   a * d   a * e   a * f   a * g   a * h   a * 
##>     i   a * j   a * k   a * l   a * m   a * n   a * o   a * p   
##>     a * q   a * r   a * s   a * t   a * u   a * v   a * w   a * 
##>     x   a * y   a * z

CodePudding user response:

Solution use regex:

# this would be the columns of a dataframe
effects_list = ['regressor_col','A', 'B', 'C', 'D', 'E','F'] 
interaction = effects_list[3]
regressor = effects_list[0]
formula = regressor   ' ~'
for effect in effects_list:
    # check if it's the interaction term if it is skip it
    #print((effect != interaction) & (effect != regressor))
    if (effect != interaction) & (effect != regressor):
        formula = formula   '   '   effect   '*'   interaction
             
    

print(formula)
  • Related