I have 1 target variable and hundreds of predictor variables. I am trying to run linear regression on one predictor variable at once and then create a dataframe to save all the univariate linear regression results (namely - variable name, p_value) using a for loop.
here are my regression codes in python (X_data has all the predictor variables and y_data has the target variable:
import statsmodels.api as sm
for column in X_Data:
exog = sm.add_constant(X_data[column],prepend = False)
mod = sm.OLS(y_data, exog)
res = mod.fit()
print(column, ' ', res.pvalues[column])
the print results look like:
variable1 0.003
variable2 0.3
...
How can I create a pandas dataframe to save all the p_value results?
CodePudding user response:
You can use apply with a lambda function for this.
X_Data['Prediction'] = X_Data.apply(lambda x: sm.OLS(y_data,x))
CodePudding user response:
You can initialize an empty container, say a dict
, before the loop then populate it and construct the DataFrame after.
d = {}
for column in X_Data:
exog = sm.add_constant(X_data[column],prepend = False)
mod = sm.OLS(y_data, exog)
res = mod.fit()
d[column] = res.pvalues[column])
df = pd.DataFrame.from_dict(d, orient='index', columns=['pval'])
# pval
#variable1 0.003
#variable2 0.300
If you need to store multiple pieces of information (coefficients, confidence intervals, standard errors...) then your dict
can store a dict
of attributes for each key.
d = {}
for column in X_Data:
...
d[column] = {'pval': res.pvalues[column], 'other_feature': ...}
print(d)
#{'variable1': {'pval': 0.003, 'other_feature': 'XX'},
# 'variable2': {'pval': 0.300, 'other_feature': 'YY'}}
df = pd.DataFrame.from_dict(d, orient='index')
# pval other_feature
#variable1 0.003 XX
#variable2 0.300 YY