Home > Back-end >  GridSearch on very large dataframe
GridSearch on very large dataframe

Time:08-03

I'm trying to perform Grid-Search on a dataframe having a size around 1 GB.

My code is:

print("[INFO] tuning hyperparameters...")
params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
model = GridSearchCV(LogisticRegression(), params, cv=3,
                 n_jobs=args["jobs"])
model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
print("[INFO] best hyperparameters: {}".format(model.best_params_))

# evaluation
print("[INFO] evaluating...")
preds = model.predict(df.iloc[:, 1:])
print(classification_report(df.iloc[:, 0], preds))

The problem I'm facing is that it's taking too much time. I googled and found a kind of solution which looks like this:

for df in pd.read_csv('1gb.csv', iterator=True, chunksize=1000):

    print("[INFO] tuning hyperparameters...")
    params = {"C": [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}
    model = GridSearchCV(LogisticRegression(), params, cv=3,
                 n_jobs=args["jobs"])
    model.fit(df.iloc[:, 1:], df.iloc[:, 0]) # df is the dataframe
    print("[INFO] best hyperparameters: {}".format(model.best_params_))

    # evaluation
    print("[INFO] evaluating...")
    preds = model.predict(df.iloc[:, 1:])
    print(classification_report(df.iloc[:, 0], preds))

It seems to work but then:

  • I'm getting parameters and prediction results for each iteration of the dataframe;
  • And, I'm getting a perfect accuracy of 1.0 in the classification report (which I think is maybe due to small chunksize which makes the model to generalize very well).

Is there a solution to this?

CodePudding user response:

It is obvious. Grid Search is compute-intensive, not efficient. Other methods like random search and Bayesian optimization would be better. There are library for this. For example, scikit learn, scikit optimize, optuna etc.

CodePudding user response:

If all you want to do is search over the regularization strength C in a logistic regression, LogisticRegressionCV will be faster, since it can use coefficient estimates from one fitting to warm-start the others.

  • Related