(Update)As I noted in the comment, somehow I got to run the code successfully but still it is taking way too long like 30 minutes. I would really appreciate if you all could help me figure out more efficient way to compose the code.
I am trying to run the code to analyze the description column in a dataframe but everytime I run it I get runt timeout error. Probably because the dataframe has more than 200,000 rows and the code below is not efficient. Could someone help me understand what is wrong with the code?
list = df["description"].tolist()
new_list = []
for li in list:
tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
node = tagger.parseToNode(li)
keywords = []
while node:
if node.feature.split(",")[2] == "組織": #組織 means organization
keywords.append(node.surface)
node = node.next
new_list.append(keywords)
df["description"] = new_list
CodePudding user response:
Create the Tagger outside the loop.
tagger = MeCab.Tagger(ipadic.MECAB_ARGS)
for li in list:
...
The Tagger has a startup cost. It's a small cost, but the object is designed to be reused and shouldn't be recreated inside a loop like that.
Please see this post about using MeCab in Python, which has a section specifically on this problem.