Error while converting corpora of saved tokens in a dataframe column into a gensim dictionary-CodePudding

I am saving a list of tokenized data with NLTK into a csv file with just one column. Later, i have to retrieve the tokens and create a dictionary of keywords using method:

       dictionary = gensim.corpora.Dictionary(column).

The problem is that when i save the tokenized file into a csv, the tokens are saved in single quotations and when i try to retrieve them and give the dataframe column to a gensim method to create a dictionary, it gives an error that the dictionary needs an array of tokens and not a string. The following steps are performed:

Steps:

The csv file with one column is:

        Description                  
 0      Key moments included the DOJ description of the FBI affidavit used for the 
        search, warnings about chilling witnesses and inaction from Trump's attorney.
 1      Russian vehicles seen inside turbine hall at Ukraine nuclear plant.
 2      Finnish PM says videos of her partying shouldn't have been made public.
        .......
        and so on

I read the csv file and then tokenize the data using the following method:

           df = pd.read_csv('news_csv.csv', encoding='latin-1') 
           def tokenize(column):
             tokens = nltk.word_tokenize(column)
             return [w for w in tokens]

Now, i tokenize the dataframe and save it again into a csv.

      processedData = df['Description'].astype(str).map(tokenize)
      processedData.to_csv('nesws_tokenized.csv', header=True, index = False)

Now, i read the column and pass it to the gensim method to create a dictionary of keywords:

      df1 = pd.read_csv('news_tokenized.csv', encoding='latin-1')
      dictionary = gensim.corpora.Dictionary(df1)

When i run it, it gives the following error:

  TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Why i am getting this error? I believe the data is saved in the column as comma separated tokens and also tokens are single quoted. Is single quotation a problem as if without saving the tokenized data, if i directly pass the processedData to gensim method, it creates the dictionary but not when i save it into a csv and retrieve it again.

Important Note: I have to save it into a csv file because the dataset is very large and the Colab session crashed due to the full use of resources and removal of keywords and lemmatization process takes almost all the ram so i can not proceed further so that is why i have to save the data into csv files and then start a new session to complete the task.

CodePudding user response：

You are getting that error, because saving the token lists into the .csv and then reading them again, results in lists being represented as strings.

For example your first token list in processedData looks like this:

['Key','moments','included','the','DOJ','description','of','the','FBI','affidavit','used','for','the','search',',','warnings','about','chilling','witnesses','and','inaction','from','Trump',"'s",'attorney','.']

However, after storing it in the .csv and reading it again, it changed:

array(['[\'Key\', \'moments\', \'included\', \'the\', \'DOJ\', \'description\', \'of\', \'the\', \'FBI\', \'affidavit\', \'used\', \'for\', \'the\', \'search\', \',\', \'warnings\', \'about\', \'chilling\', \'witnesses\', \'and\', \'inaction\', \'from\', \'Trump\', "\'s", \'attorney\', \'.\']'],
      dtype=object)

It is now not a list of strings anymore, but an array with one element, which is a string (containing the original list):

print(type(df1.iloc[0].values))
print(len(df1.iloc[0].values))
print(type(df1.iloc[0].values[0]))

Output:

<class 'numpy.ndarray'>
1
<class 'str'>

The simplest solution to this problem would be to not store the data in the .csv in the first place and to use it directly via dictionary = gensim.corpora.Dictionary(processedData). But since you have to store it due to problems with the colab session, you have to read each string from the rows as a string:

import ast

list_of_rows = []

for row in df1["Description"]:
  list_of_rows.append(ast.literal_eval(row))

#Put it into a pandas dataframe only for the visualization in Stackoverflow:
pd.DataFrame(list_of_rows)

Output:

index	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
0	Key	moments	included	the	DOJ	description	of	the	FBI	affidavit	used	for	the	search	,	warnings	about	chilling	witnesses	and
1	Russian	vehicles	seen	inside	turbine	hall	at	Ukraine	nuclear	plant	.
2	Finnish	PM	says	videos	of	her	partying	should	n't	have	been	made	public	.

Now each row is represented as a list of tokens again and you can form your gensim dictionary:

dictionary = gensim.corpora.Dictionary(list_of_rows)

Test:

print(dictionary)

Output:

Dictionary(46 unique tokens: ["'s", ',', '.', 'DOJ', 'FBI']...)