How to put linear regression results (variable name, p_value) into a dataframe using for loop?-CodePudding

I have 1 target variable and hundreds of predictor variables. I am trying to run linear regression on one predictor variable at once and then create a dataframe to save all the univariate linear regression results (namely - variable name, p_value) using a for loop.

here are my regression codes in python (X_data has all the predictor variables and y_data has the target variable:

import statsmodels.api as sm
for column in X_Data:
    exog = sm.add_constant(X_data[column],prepend = False)
    mod = sm.OLS(y_data, exog)
    res = mod.fit()
    print(column, ' ', res.pvalues[column])

the print results look like:

variable1 0.003
variable2 0.3

...

How can I create a pandas dataframe to save all the p_value results?

CodePudding user response：

You can use apply with a lambda function for this.

X_Data['Prediction'] = X_Data.apply(lambda x: sm.OLS(y_data,x))

CodePudding user response：

You can initialize an empty container, say a dict, before the loop then populate it and construct the DataFrame after.

d = {}
for column in X_Data:
    exog = sm.add_constant(X_data[column],prepend = False)
    mod = sm.OLS(y_data, exog)
    res = mod.fit()
    d[column] = res.pvalues[column])

df = pd.DataFrame.from_dict(d, orient='index', columns=['pval'])
#            pval
#variable1  0.003
#variable2  0.300

If you need to store multiple pieces of information (coefficients, confidence intervals, standard errors...) then your dict can store a dict of attributes for each key.

d = {}
for column in X_Data:
    ...
    d[column] = {'pval': res.pvalues[column], 'other_feature': ...}

print(d)
#{'variable1': {'pval': 0.003, 'other_feature': 'XX'}, 
# 'variable2': {'pval': 0.300, 'other_feature': 'YY'}}

df = pd.DataFrame.from_dict(d, orient='index')
#            pval  other_feature
#variable1  0.003             XX
#variable2  0.300             YY