I have a dataset named "covid" of shape and head
number of instances: 19345
number of attributes: 7
submission_date state new_case new_death density latitude longitude
0 2020-06-01 KS 292 9 71.401302 39.011902 -98.484246
1 2020-06-01 WA 271 6 96.704458 47.751074 -120.740139
2 2020-06-01 MT 8 0 6.837955 46.879682 -110.362566
3 2020-06-01 IA 146 15 54.642103 41.878003 -93.097702
4 2020-06-01 KY 136 6 NaN 37.839333 -84.270018
Each object represents a jurisdiction's per diem covid data along with some info about the jurisdiction- 365 objects per jursidiction (states and some territories).
How can I find a relationship between the submission_date, longitude, and latitude columns as independent variables and the new_case column as the dependent variable? I guess this would be a multiple regression, but I am new to the field and pandas, and have never created a multiple regression.
CodePudding user response:
There are many model types and packages that you could use. I'll show an example using catboost:
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
# Initialize data
df['submission_date_feature'] = df['submission_date'].as(int)
train_cols = ['submission_date_feature', 'longitude', 'latitude']
label_col = 'new_case'
X_train, y_train, X_test, y_test = train_test_split([df[train_cols], df[label_col]], test_size=0.2)
train_data = Pool(X_train, y_train)
eval_data = Pool(X_test, y_test)
# Initialize CatBoostRegressor
model = CatBoostRegressor(iterations=10,
learning_rate=1,
depth=3)
# Fit model
model.fit(train_data, eval_set=eval_data)
CodePudding user response:
As a benchmark, you can run an OLS regression:
import statsmodels.api as sm
Y = df['new_case'].values
df['submission_date_int'] = df['submission_date'].astype(int)
X = df[['submission_date_int', 'longitude', 'latitude']].values
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())
Or use sklearn.linear_model.LinearRegression
.