I got MemoryError while using sklearn.naive_bayes.MultinomialNB
for Named Entity Recognition on a big data, where train.shape = (416330, 97896)
.
data_train = pd.read_csv(path[0] "train_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
data_test = pd.read_csv(path[0] "test_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
print('TRAIN_DATA:\n', data_train.tail(5))
# FIT TRANSFORM
X_TRAIN = data_train.drop('Tag', axis=1)
X_TEST = data_test.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
X_test = v.transform(X_TEST.to_dict('records'))
y_train = data_train.Tag.values
y_test = data_test.Tag.values
classes = np.unique(y_train)
classes = classes.tolist()
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)
new_classes = classes.copy()
new_classes.pop()
predictions = nb.predict(X_test)
The error output is following:
Traceback (most recent call last):
File "naive_bayes_classifier/main.py", line 107, in <module>
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 313, in fit_transform
return self._transform(X, fitting=True)
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 282, in _transform
result_matrix = result_matrix.toarray()
File "lib/python3.8/site-packages/scipy/sparse/compressed.py", line 1031, in toarray
out = self._process_toarray_args(order, out)
File "lib/python3.8/site-packages/scipy/sparse/base.py", line 1202, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
numpy.core._exceptions.MemoryError: Unable to allocate 304. GiB for an array with shape (416330, 97896) and data type float64
Although, I put such parameter for DictVectorizer(sparse=False, dtype=np.short)
, the code then returns error for nb.partial(X_train, y_train, classes)
line.
How can I prevent this Memory Error? Is there a proper way to solve it? I thought about splitting the train set, but since the vectors fit the corresponding dataset, will it be correct solution?
CodePudding user response:
Sadly, the problem is that solving this problem with sklearn does require a massive amount of memory due to its complexity.
.partial_fit()
is a good way to go, but I would recommend chunking Your data into much smaller pieces and then partial fitting the classifier on them. Try to split Your dataset into small pieces and see if that works. If You are still getting the same error, perhaps try even smaller bits