I have a situation where I need to train a regression model that will have 100 features. I want to look for interaction effects between all 100 features and one other feature. I would like to find a way to do this programatically as well since this analysis is going to be recuring and I don't want to have to reprogram a new formula each time this analysis is run. I want it to be automated. So how can I get a model that is like so
Y~a*b a*c .... a*z
But for 100 terms? How do I get the R formula to do this? Note I will be using statsmodels in python but I think the syntax is the same.
CodePudding user response:
lm(Y ~ a * ., df)
eg
lm(Sepal.Width ~ Sepal.Length * ., iris)
Call:
lm(formula = Sepal.Width ~ Sepal.Length * ., data = iris)
Coefficients:
(Intercept) Sepal.Length Petal.Length Petal.Width
-0.91350 0.82954 0.29569 0.85334
Speciesversicolor Speciesvirginica Sepal.Length:Petal.Length Sepal.Length:Petal.Width
0.05894 -0.89244 -0.05394 -0.04654
Sepal.Length:Speciesversicolor Sepal.Length:Speciesvirginica
-0.32823 -0.21910
CodePudding user response:
Here is an example of how to construct the wanted string and then convert to a formula
paste("a", letters[2:26], sep = "*") |>
paste(collapse = " ") |>
sprintf(fmt = "Y ~ %s") |>
as.formula()
##> Y ~ a * b a * c a * d a * e a * f a * g a * h a *
##> i a * j a * k a * l a * m a * n a * o a * p
##> a * q a * r a * s a * t a * u a * v a * w a *
##> x a * y a * z
CodePudding user response:
Solution use regex:
# this would be the columns of a dataframe
effects_list = ['regressor_col','A', 'B', 'C', 'D', 'E','F']
interaction = effects_list[3]
regressor = effects_list[0]
formula = regressor ' ~'
for effect in effects_list:
# check if it's the interaction term if it is skip it
#print((effect != interaction) & (effect != regressor))
if (effect != interaction) & (effect != regressor):
formula = formula ' ' effect '*' interaction
print(formula)