I'm trying to perform Grid-Search on a dataframe having a size around 1 GB.
My code is:
print("[INFO] tuning hyperparameters...")
params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
model = GridSearchCV(LogisticRegression(), params, cv=3,
n_jobs=args["jobs"])
model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
print("[INFO] best hyperparameters: {}".format(model.best_params_))
# evaluation
print("[INFO] evaluating...")
preds = model.predict(df.iloc[:, 1:])
print(classification_report(df.iloc[:, 0], preds))
The problem I'm facing is that it's taking too much time. I googled and found a kind of solution which looks like this:
for df in pd.read_csv('1gb.csv', iterator=True, chunksize=1000):
print("[INFO] tuning hyperparameters...")
params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
model = GridSearchCV(LogisticRegression(), params, cv=3,
n_jobs=args["jobs"])
model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
print("[INFO] best hyperparameters: {}".format(model.best_params_))
# evaluation
print("[INFO] evaluating...")
preds = model.predict(df.iloc[:, 1:])
print(classification_report(df.iloc[:, 0], preds))
It seems to work but then:
- I'm getting parameters and prediction results for each iteration of the dataframe;
- And, I'm getting a perfect accuracy of 1.0 in the classification report (which I think is maybe due to small chunksize which makes the model to generalize very well).
Is there a solution to this?
CodePudding user response:
It is obvious. Grid Search is compute-intensive, not efficient. Other methods like random search and Bayesian optimization would be better. There are library for this. For example, scikit learn, scikit optimize, optuna etc.
CodePudding user response:
If all you want to do is search over the regularization strength C
in a logistic regression, LogisticRegressionCV
will be faster, since it can use coefficient estimates from one fitting to warm-start the others.