Getting counts on string values in Python dataframe-CodePudding

Following this tutorial series. I'm wondering with a dataframe such as this, how can I get a count of specific words appearing in 'lemmatized' e.g. 'phase' or 'idea'? I've tried value counts and a bunch of other suggestions but with no success. And also say only for a given 'rating' score. The tutorial only deals with overall top frequencies but this would be really helpful. Many thanks in advance.

Data Image

CodePudding user response：

You can try something like this:

pdf = pd.DataFrame([
    [1, ["hello", "how", "are", "you"]],
    [2, ["hello", "I", "am", "fine", "you"]]
], columns=["rating", "lemmatized"])

>>> pdf
    rating  lemmatized
0   1      [hello, how, are, you]
1   2      [hello, I, am, fine, you]

pdf_new = pdf.explode("lemmatized").reset_index(drop=True)

>> pdf_new
    rating  lemmatized
0   1   hello
1   1   how
2   1   are
3   1   you
4   2   hello
5   2   I
6   2   am
7   2   fine
8   2   you

>>> pdf_new.groupby("lemmatized").size()

lemmatized
I        1
am       1
are      1
fine     1
hello    2
how      1
you      2
dtype: int64

>>> chosen_words = ["hello", "I"]
>>> pdf_new[(pdf_new.lemmatized.isin(chosen_words)) & (pdf_new.rating==2)].groupby("lemmatized").size()
lemmatized
I        1
hello    1
dtype: int64