I've got this huge dataset of 300.000 articles and I wanted to use Spacy's en_core_web_sm to do Tokenization, POS tagging, lemmatization, syntactic dependencies and NER. However my pc keeps running out of RAM. Is there a way in which I can change my code to process the data in chunks?
This is the dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ULHLCB
This is what I;m using:
df_2018 = pd.read_csv("2018_articles.csv")
import spacy
nlp_spacy_core_web_sm = spacy.load("en_core_web_sm")
df_18_salon["spacy_sm"] = df_18_salon["content"].apply(lambda x: nlp_spacy_core_web_sm(x))
After about 30 minutes I get a out of memory error.
CodePudding user response:
The problem is you aren't going to be able to keep all the Docs (spaCy output) in memory at the same time, so you can't just put the output in a column of a dataframe. Also note this is not a spaCy issue, this is a programming issue.
You need to write a for loop and put your processing in it:
for text in texts:
doc = nlp(text)
... do something with the doc ...
If you do this then the doc will be cleaned up on the next iteration of the for loop, so it won't take up memory.
You may also want to look at the spaCy speed FAQ.