I'm trying to aggregate text fields based on date and category columns.
And below is how the initial dataset looks like
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics
So expected output that I want to get is the below
created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics
As per the example I actually want to aggregate tweet text based on date and category. Below is how I used to aggregate without considering category, where I'm in need of aggregation as per the output above.It would be very helpful if anyone can answer this
import pandas as pd
def aggregated():
tweets = pd.read_csv(r'data_set.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
pd.set_option('display.max_colwidth', 0)
df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
return df
# Driver code
if __name__ == '__main__':
print(aggregated())
aggregated().to_csv(r'agg-1.csv',index = True, header=True)
CodePudding user response:
You can use:
out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
.apply(lambda x: ' '.join(x))[df.columns]
print(out)
Output:
>>> out
created_at tweet category
0 7/29/2021 Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan sports
1 7/29/2021 President Made a clear statement An election is to be kept next year politics
2 7/29/2021 A terrorist attack have killed 10 people crime
3 8/29/2021 Srilanka have lost the T20 series Australia have won the series sports
4 8/29/2021 Minister have given up his role last monday President is challenging the opposite leader politics
CodePudding user response:
df is your example at first tweet column make list with groupby and join list by apply
df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df
output:
created_at tweet category
0 7/29/2021 A terrorist attack have killed 10 people crime
1 7/29/2021 President Made a clear statement An election i... politics
2 7/29/2021 Great Sunny day for Cricket at London Great Sc... sports
3 8/29/2021 Minister have given up his role last monday Pr... politics
4 8/29/2021 Srilanka have lost the T20 series Australia ha... sports