Python MemoryError when fitting scikit-learn model-CodePudding

For a research project, I am analyzing correlations using various machine learning algorithms. As such, I run the following code (simplified for demonstration):

# Make a custom scorer for pearson's r (from scipy)    
scorer = lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]

# Create a progress bar
progress_bar = tqdm(14400)

# Initialize a dataframe to store scores
df = pd.DataFrame(columns=["data", "pipeline", "r"])

# Loop over datasets
for data in datasets: #288 datasets
    X_train = data.X_train
    X_test = data.X_test
    y_train = data.y_train
    y_test = data.y_test
    
    # Loop over pipelines
    for pipeline in pipelines: #50 pipelines
        scores = cross_val_score(pipeline, X_train, y_train, cv=int(len(X_train)/3), scoring=scorer)
        r = scores.mean()
        # Create a new row to save data
        df.loc[(df.last_valid_index() or 0)   1] = {"data": data.name, "pipeline": pipeline, "r": r}
        progress_bar.update(1)

progress_bar.close()

X_train is a pandas dataframe with shape (20, 34)

X_test is a pandas dataframe with shape (9, 34)

y_train is pandas series with length 20

y_test is a pandas series with length 9

An example of pipeline is:

Pipeline(steps=[('scaler', StandardScaler()),
                ('poly', PolynomialFeatures(degree=9)),
                ('regressor', LinearRegression())])

However, after approximately 8700 iterations (total), I get the following MemoryError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-9ff48105b8ff> in <module>
     40                 y = targets[label]
     41                 #Finally, we can test the correlation
---> 42                 scores = cross_val_score(regressor, X_train, y.loc[train_indices], cv=int(len(X_train)/3), scoring=lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]) #Three samples per test set, as that seems like the logical minimum for Pearson
     43                 r = scores.mean()
     44 #                     print(f"{regressor} was able to predict {label} based on the {band} band of the {network} network with a Pearson's r of {r} of the data that could be explained.\n")

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    513     scorer = check_scoring(estimator, scoring=scoring)
    514 
--> 515     cv_results = cross_validate(
    516         estimator=estimator,
    517         X=X,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    283     )
    284 
--> 285     _warn_or_raise_about_fit_failures(results, error_score)
    286 
    287     # For callabe scoring, the return type is only know after calling. If the

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
    365                 f"Below are more details about the failures:\n{fit_errors_summary}"
    366             )
--> 367             raise ValueError(all_fits_failed_message)
    368 
    369         else:

ValueError: 
All the 6 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.8 GiB for an array with shape (16, 350343565) and data type float64

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 44.4 GiB for an array with shape (17, 350343565) and data type float64

What can I do to prevent this error, and how did it originate in the first place? I tried using sklearn's clone function on the pipeline that was still in my memory, and then calling fit, but I got the same error. However, when I created a new pipeline (still in the same session), and called fit on it, it did work.

CodePudding user response：

The problem is the ginormous basis expansion you're doing. Adding 9th degree polynomial features for 34 features results in 52,451,256 features. Even though you only have a handful of samples, it's no wonder you're running out of memory.

Just look at what a 2nd degree PolynomialFeatures gives you for 4 features:

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline

>>> arr = np.random.random(size=(10, 4))
>>> poly = PolynomialFeatures(degree=2).fit(arr)
>>> poly.get_feature_names()

This results in:

['1',
 'x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

If you use even 52 features on 20 instances of data, you are likely well into overfitting territory. Even degree 2 polynomials on your data will give you 630 features, which is way too many. I would use inspection (e.g. pair plots), feature importance, and maybe PCA to reduce the dimensionality, then ditch the basis expansion until you know what direction things are going.

CodePudding user response：

MemoryError means that Python interpreter runs out of RAM and swap space to allocate new memory. Usually the solutions include 1) work with smaller dataset 2) getting a computer with more RAM. 3) Checking your code does not leak memory.