GridSearch on very large dataframe-CodePudding

I'm trying to perform Grid-Search on a dataframe having a size around 1 GB.

My code is:

print("[INFO] tuning hyperparameters...")
params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
model = GridSearchCV(LogisticRegression(), params, cv=3,
                 n_jobs=args["jobs"])
model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
print("[INFO] best hyperparameters: {}".format(model.best_params_))

# evaluation
print("[INFO] evaluating...")
preds = model.predict(df.iloc[:, 1:])
print(classification_report(df.iloc[:, 0], preds))

The problem I'm facing is that it's taking too much time. I googled and found a kind of solution which looks like this:

for df in pd.read_csv('1gb.csv', iterator=True, chunksize=1000):

    print("[INFO] tuning hyperparameters...")
    params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
    model = GridSearchCV(LogisticRegression(), params, cv=3,
                 n_jobs=args["jobs"])
    model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
    print("[INFO] best hyperparameters: {}".format(model.best_params_))

    # evaluation
    print("[INFO] evaluating...")
    preds = model.predict(df.iloc[:, 1:])
    print(classification_report(df.iloc[:, 0], preds))

It seems to work but then:

I'm getting parameters and prediction results for each iteration of the dataframe;
And, I'm getting a perfect accuracy of 1.0 in the classification report (which I think is maybe due to small chunksize which makes the model to generalize very well).

Is there a solution to this?

CodePudding user response：

It is obvious. Grid Search is compute-intensive, not efficient. Other methods like random search and Bayesian optimization would be better. There are library for this. For example, scikit learn, scikit optimize, optuna etc.

CodePudding user response：

If all you want to do is search over the regularization strength C in a logistic regression, LogisticRegressionCV will be faster, since it can use coefficient estimates from one fitting to warm-start the others.