pandas: finding relationship between multiple columns of dataset-CodePudding

I have a dataset named "covid" of shape and head

number of instances:  19345
number of attributes:  7
  submission_date state  new_case  new_death    density   latitude   longitude
0      2020-06-01    KS       292          9  71.401302  39.011902  -98.484246
1      2020-06-01    WA       271          6  96.704458  47.751074 -120.740139
2      2020-06-01    MT         8          0   6.837955  46.879682 -110.362566
3      2020-06-01    IA       146         15  54.642103  41.878003  -93.097702
4      2020-06-01    KY       136          6        NaN  37.839333  -84.270018

Each object represents a jurisdiction's per diem covid data along with some info about the jurisdiction- 365 objects per jursidiction (states and some territories).

How can I find a relationship between the submission_date, longitude, and latitude columns as independent variables and the new_case column as the dependent variable? I guess this would be a multiple regression, but I am new to the field and pandas, and have never created a multiple regression.

CodePudding user response：

There are many model types and packages that you could use. I'll show an example using catboost:

from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split

# Initialize data
df['submission_date_feature'] = df['submission_date'].as(int)
train_cols = ['submission_date_feature', 'longitude', 'latitude']
label_col = 'new_case'

X_train, y_train, X_test, y_test = train_test_split([df[train_cols], df[label_col]], test_size=0.2)

train_data = Pool(X_train, y_train)
eval_data = Pool(X_test, y_test)

# Initialize CatBoostRegressor
model = CatBoostRegressor(iterations=10,
                          learning_rate=1,
                          depth=3)
# Fit model
model.fit(train_data, eval_set=eval_data)

CodePudding user response：

As a benchmark, you can run an OLS regression:

import statsmodels.api as sm
Y = df['new_case'].values
df['submission_date_int'] = df['submission_date'].astype(int)
X = df[['submission_date_int', 'longitude', 'latitude']].values
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

Or use sklearn.linear_model.LinearRegression.