I have a pandas dataframe with year in one column, and a text field in another.
I have a series of keywords.
I wish to return the amount of rows containing each keyword grouped by year.
How can I do this?
e.g.
import pandas as pd
txt = """2019 'an example sentence'
2019 'another sentence'
2020 'fox trot'
2020 'this sentence has some new words'
2020 'new and old example words here'
2021 'the fox jumps for example'
2021 'a different example sentence'"""
df = pd.DataFrame([x.split(' ') for x in txt.split('\n')], columns=['year','text'])
keywords = ['example', 'sentence', 'fox']
Output:
year example sentence fox
2019 1 2 0
2020 0 1 1
2021 2 0 1
CodePudding user response:
You could use Series.str.findall
to find the keywords in your list and assign back to a new column which we can explode()
. Then perform a pivot_table()
:
df['key'] = df['text'].str.findall('|'.join(keywords))
df = df.drop('text',axis=1).explode('key')
df.pivot_table(index='year',columns='key',aggfunc='size',fill_value=0)
Results in:
key example fox sentence
year
2019 1 0 2
2020 1 1 1
2021 2 1 1
Update:
keywords = ['fox|this', 'example', 'sentence']
df['key'] = df['text'].str.findall('|'.join(keywords))
df = df.drop('text',axis=1).explode('key')
df.pivot_table(index='year',columns='key',aggfunc='size',fill_value=0).reset_index()
key example fox sentence this
year
2019 1 0 2 0
2020 1 1 1 1
2021 2 1 1 0
In case you need it, put together in 1:
(
df.assign(key=df['text'].str.findall('|'.join(keywords))))\
.drop('text',axis=1).explode('key')\
.pivot_table(index='year',columns='key',aggfunc='size',fill_value=0
)
CodePudding user response:
This will do the trick. If a word appears twice in a sentence it is only counted once. It is case sensitive.
def make_series_for_keyword(keyword):
return df.groupby(["year"]).apply(lambda d: d["text"].str.contains(keyword).sum()).rename(keyword)
output = pd.concat([make_series_for_keyword(keyword) for keyword in keywords], axis=1)
CodePudding user response:
A oneliner:
df.groupby('year').agg(example=("text",lambda x: x.str.contains('example').sum()),sentence=("text",lambda x: x.str.contains('sentence').sum()),fox=("text",lambda x: x.str.contains('fox').sum()))
year | example | sentence | fox |
---|---|---|---|
2019 | 1 | 2 | 0 |
2020 | 1 | 1 | 1 |
2021 | 2 | 1 | 1 |
CodePudding user response:
Use str.split
:
>>> df[['year']].join(df['text'].str[1:-1].str.split()
.explode().rename('keywords')).assign(count=1) \
.pivot_table('count', 'year', 'keywords', aggfunc='size')[keywords] \
.fillna(0).astype(int)
keywords example sentence fox
year
2019 1 2 0
2020 1 1 1
2021 2 1 1