Home > Software engineering >  Getting counts on string values in Python dataframe
Getting counts on string values in Python dataframe

Time:09-03

Following this tutorial series. I'm wondering with a dataframe such as this, how can I get a count of specific words appearing in 'lemmatized' e.g. 'phase' or 'idea'? I've tried value counts and a bunch of other suggestions but with no success. And also say only for a given 'rating' score. The tutorial only deals with overall top frequencies but this would be really helpful. Many thanks in advance.

Data Image

CodePudding user response:

You can try something like this:

pdf = pd.DataFrame([
    [1, ["hello", "how", "are", "you"]],
    [2, ["hello", "I", "am", "fine", "you"]]
], columns=["rating", "lemmatized"])

>>> pdf
    rating  lemmatized
0   1      [hello, how, are, you]
1   2      [hello, I, am, fine, you]

pdf_new = pdf.explode("lemmatized").reset_index(drop=True)

>> pdf_new
    rating  lemmatized
0   1   hello
1   1   how
2   1   are
3   1   you
4   2   hello
5   2   I
6   2   am
7   2   fine
8   2   you

>>> pdf_new.groupby("lemmatized").size()

lemmatized
I        1
am       1
are      1
fine     1
hello    2
how      1
you      2
dtype: int64

>>> chosen_words = ["hello", "I"]
>>> pdf_new[(pdf_new.lemmatized.isin(chosen_words)) & (pdf_new.rating==2)].groupby("lemmatized").size()
lemmatized
I        1
hello    1
dtype: int64
  • Related