how to convert pandas text column to nltk text object-CodePudding

I have following dataframe in pandas

publish_date    headline_text
20030219        aba decides against community broadcasting 
20030219        act fire witnesses must be aware of defamation
20030219        a g calls for infrastructure protection summit
20030219        air nz staff in aust strike for pay rise
20030219        air nz strike to affect australian travellers

I want to convert headline_text column to nltk text object in order to apply all nltk methods on it.

I am doing following, but it does not seem to work

headline_text = nlp_df['headline_text'].apply(lambda x: ''.join(x))

CodePudding user response：

You can do:

nltk_col = df.headline_text.apply(lambda row: nltk.Text(row.split(' ')))

To assign this column to the dataframe, you can then do:

df=df.assign(nltk_texts=nltk_col)

Then we can check the type of the first row in the new nltk_texts column:

print(type(df.nltk_texts.loc[0])) # outputs: nltk.text.Text

To unify all rows into a single NLTK Text object, you can do:

single = nltk.Text([word for row in df.headline_text for word in row.split(' ')])

Then print(type(single)) will output nltk.text.Text.