MemoryError using MultinomilNB-CodePudding

I got MemoryError while using sklearn.naive_bayes.MultinomialNB for Named Entity Recognition on a big data, where train.shape = (416330, 97896).

data_train = pd.read_csv(path[0]   "train_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)
data_test = pd.read_csv(path[0]   "test_SENTENCED.tsv", encoding="utf-8", sep='\t', quoting=csv.QUOTE_NONE)

print('TRAIN_DATA:\n', data_train.tail(5))

# FIT TRANSFORM
X_TRAIN = data_train.drop('Tag', axis=1)
X_TEST = data_test.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X_train = v.fit_transform(X_TRAIN.to_dict('records'))
X_test = v.transform(X_TEST.to_dict('records'))
y_train = data_train.Tag.values
y_test = data_test.Tag.values
classes = np.unique(y_train)
classes = classes.tolist()

nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)
new_classes = classes.copy()
new_classes.pop()
predictions = nb.predict(X_test)

The error output is following:

Traceback (most recent call last):  
File "naive_bayes_classifier/main.py", line 107, in <module>  
      X_train = v.fit_transform(X_TRAIN.to_dict('records'))  
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 313, in fit_transform  
      return self._transform(X, fitting=True)  
File "lib/python3.8/site-packages/sklearn/feature_extraction/_dict_vectorizer.py", line 282, in _transform  
      result_matrix = result_matrix.toarray()  
File "lib/python3.8/site-packages/scipy/sparse/compressed.py", line 1031, in toarray  
      out = self._process_toarray_args(order, out)  
File "lib/python3.8/site-packages/scipy/sparse/base.py", line 1202, in _process_toarray_args  
      return np.zeros(self.shape, dtype=self.dtype, order=order) 
numpy.core._exceptions.MemoryError: Unable to allocate 304. GiB for an array with shape (416330, 97896) and data type float64

Although, I put such parameter for DictVectorizer(sparse=False, dtype=np.short), the code then returns error for nb.partial(X_train, y_train, classes) line.

How can I prevent this Memory Error? Is there a proper way to solve it? I thought about splitting the train set, but since the vectors fit the corresponding dataset, will it be correct solution?

CodePudding user response：

Sadly, the problem is that solving this problem with sklearn does require a massive amount of memory due to its complexity.

.partial_fit() is a good way to go, but I would recommend chunking Your data into much smaller pieces and then partial fitting the classifier on them. Try to split Your dataset into small pieces and see if that works. If You are still getting the same error, perhaps try even smaller bits