Following this tutorial series. I'm wondering with a dataframe such as this, how can I get a count of specific words appearing in 'lemmatized' e.g. 'phase' or 'idea'? I've tried value counts and a bunch of other suggestions but with no success. And also say only for a given 'rating' score. The tutorial only deals with overall top frequencies but this would be really helpful. Many thanks in advance.
CodePudding user response:
You can try something like this:
pdf = pd.DataFrame([
[1, ["hello", "how", "are", "you"]],
[2, ["hello", "I", "am", "fine", "you"]]
], columns=["rating", "lemmatized"])
>>> pdf
rating lemmatized
0 1 [hello, how, are, you]
1 2 [hello, I, am, fine, you]
pdf_new = pdf.explode("lemmatized").reset_index(drop=True)
>> pdf_new
rating lemmatized
0 1 hello
1 1 how
2 1 are
3 1 you
4 2 hello
5 2 I
6 2 am
7 2 fine
8 2 you
>>> pdf_new.groupby("lemmatized").size()
lemmatized
I 1
am 1
are 1
fine 1
hello 2
how 1
you 2
dtype: int64
>>> chosen_words = ["hello", "I"]
>>> pdf_new[(pdf_new.lemmatized.isin(chosen_words)) & (pdf_new.rating==2)].groupby("lemmatized").size()
lemmatized
I 1
hello 1
dtype: int64