Pandas group by year with counts of occurrences in column?-CodePudding

I have a pandas dataframe with year in one column, and a text field in another.

I have a series of keywords.

I wish to return the amount of rows containing each keyword grouped by year.

How can I do this?

e.g.

import pandas as pd

txt = """2019  'an example sentence'
2019  'another sentence'
2020  'fox trot'
2020  'this sentence has some new words'
2020  'new and old example words here'
2021  'the fox jumps for example'
2021  'a different example sentence'"""

df = pd.DataFrame([x.split('  ') for x in txt.split('\n')], columns=['year','text'])

keywords = ['example', 'sentence', 'fox']

Output:

year  example  sentence  fox 
2019  1        2         0
2020  0        1         1
2021  2        0         1

CodePudding user response：

You could use Series.str.findall to find the keywords in your list and assign back to a new column which we can explode(). Then perform a pivot_table():

df['key'] = df['text'].str.findall('|'.join(keywords))
df = df.drop('text',axis=1).explode('key')
df.pivot_table(index='year',columns='key',aggfunc='size',fill_value=0)

Results in:

key   example  fox  sentence
year                        
2019        1    0         2
2020        1    1         1
2021        2    1         1

Update:

keywords = ['fox|this', 'example', 'sentence']
df['key'] = df['text'].str.findall('|'.join(keywords))
df = df.drop('text',axis=1).explode('key')
df.pivot_table(index='year',columns='key',aggfunc='size',fill_value=0).reset_index()

key   example  fox  sentence  this
year                              
2019        1    0         2     0
2020        1    1         1     1
2021        2    1         1     0

In case you need it, put together in 1:

(
 df.assign(key=df['text'].str.findall('|'.join(keywords))))\
    .drop('text',axis=1).explode('key')\
        .pivot_table(index='year',columns='key',aggfunc='size',fill_value=0
                     )

CodePudding user response：

This will do the trick. If a word appears twice in a sentence it is only counted once. It is case sensitive.

def make_series_for_keyword(keyword):
    return df.groupby(["year"]).apply(lambda d: d["text"].str.contains(keyword).sum()).rename(keyword)

output = pd.concat([make_series_for_keyword(keyword) for keyword in keywords], axis=1)

CodePudding user response：

A oneliner:

df.groupby('year').agg(example=("text",lambda x: x.str.contains('example').sum()),sentence=("text",lambda x: x.str.contains('sentence').sum()),fox=("text",lambda x: x.str.contains('fox').sum()))

year	example	sentence	fox
2019	1	2	0
2020	1	1	1
2021	2	1	1

CodePudding user response：

Use str.split:

>>> df[['year']].join(df['text'].str[1:-1].str.split()
                     .explode().rename('keywords')).assign(count=1) \
                .pivot_table('count', 'year', 'keywords', aggfunc='size')[keywords] \
                .fillna(0).astype(int)

keywords  example  sentence  fox
year                            
2019            1         2    0
2020            1         1    1
2021            2         1    1