For a research project, I am analyzing correlations using various machine learning algorithms. As such, I run the following code (simplified for demonstration):
# Make a custom scorer for pearson's r (from scipy)
scorer = lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]
# Create a progress bar
progress_bar = tqdm(14400)
# Initialize a dataframe to store scores
df = pd.DataFrame(columns=["data", "pipeline", "r"])
# Loop over datasets
for data in datasets: #288 datasets
X_train = data.X_train
X_test = data.X_test
y_train = data.y_train
y_test = data.y_test
# Loop over pipelines
for pipeline in pipelines: #50 pipelines
scores = cross_val_score(pipeline, X_train, y_train, cv=int(len(X_train)/3), scoring=scorer)
r = scores.mean()
# Create a new row to save data
df.loc[(df.last_valid_index() or 0) 1] = {"data": data.name, "pipeline": pipeline, "r": r}
progress_bar.update(1)
progress_bar.close()
X_train is a pandas dataframe with shape (20, 34)
X_test is a pandas dataframe with shape (9, 34)
y_train is pandas series with length 20
y_test is a pandas series with length 9
An example of pipeline is:
Pipeline(steps=[('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=9)),
('regressor', LinearRegression())])
However, after approximately 8700 iterations (total), I get the following MemoryError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-9ff48105b8ff> in <module>
40 y = targets[label]
41 #Finally, we can test the correlation
---> 42 scores = cross_val_score(regressor, X_train, y.loc[train_indices], cv=int(len(X_train)/3), scoring=lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]) #Three samples per test set, as that seems like the logical minimum for Pearson
43 r = scores.mean()
44 # print(f"{regressor} was able to predict {label} based on the {band} band of the {network} network with a Pearson's r of {r} of the data that could be explained.\n")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
513 scorer = check_scoring(estimator, scoring=scoring)
514
--> 515 cv_results = cross_validate(
516 estimator=estimator,
517 X=X,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
283 )
284
--> 285 _warn_or_raise_about_fit_failures(results, error_score)
286
287 # For callabe scoring, the return type is only know after calling. If the
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
365 f"Below are more details about the failures:\n{fit_errors_summary}"
366 )
--> 367 raise ValueError(all_fits_failed_message)
368
369 else:
ValueError:
All the 6 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
X, y, X_offset, y_offset, X_scale = _preprocess_data(
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.8 GiB for an array with shape (16, 350343565) and data type float64
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
X, y, X_offset, y_offset, X_scale = _preprocess_data(
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 44.4 GiB for an array with shape (17, 350343565) and data type float64
What can I do to prevent this error, and how did it originate in the first place? I tried using sklearn's clone function on the pipeline that was still in my memory, and then calling fit, but I got the same error. However, when I created a new pipeline (still in the same session), and called fit on it, it did work.
CodePudding user response:
The problem is the ginormous basis expansion you're doing. Adding 9th degree polynomial features for 34 features results in 52,451,256 features. Even though you only have a handful of samples, it's no wonder you're running out of memory.
Just look at what a 2nd degree PolynomialFeatures
gives you for 4 features:
>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline
>>> arr = np.random.random(size=(10, 4))
>>> poly = PolynomialFeatures(degree=2).fit(arr)
>>> poly.get_feature_names()
This results in:
['1',
'x0',
'x1',
'x2',
'x3',
'x0^2',
'x0 x1',
'x0 x2',
'x0 x3',
'x1^2',
'x1 x2',
'x1 x3',
'x2^2',
'x2 x3',
'x3^2']
If you use even 52 features on 20 instances of data, you are likely well into overfitting territory. Even degree 2 polynomials on your data will give you 630 features, which is way too many. I would use inspection (e.g. pair plots), feature importance, and maybe PCA to reduce the dimensionality, then ditch the basis expansion until you know what direction things are going.
CodePudding user response:
MemoryError means that Python interpreter runs out of RAM and swap space to allocate new memory. Usually the solutions include 1) work with smaller dataset 2) getting a computer with more RAM. 3) Checking your code does not leak memory.