How to Aggregate data based on multiple columns in Python-CodePudding

I'm trying to aggregate text fields based on date and category columns.
And below is how the initial dataset looks like

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London,sports
7/29/2021,Great Score put on by England batting,sports
7/29/2021,President Made a clear statement,politics
7/29/2021,Olympic is to held in Japan,sports
7/29/2021,A terrorist attack have killed 10 people,crime
7/29/2021,An election is to be kept next year,politics
8/29/2021,Srilanka have lost the T20 series,sports
8/29/2021,Australia have won the series,sports
8/29/2021,Minister have given up his role last monday,politics
8/29/2021,President is challenging the opposite leader,politics

So expected output that I want to get is the below

created_at,tweet,category
7/29/2021,Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan,sports
7/29/2021,President Made a clear statement An election is to be kept next year,politics
7/29/2021,A terrorist attack have killed 10 people,crime
8/29/2021,Srilanka have lost the T20 series Australia have won the series,sports
8/29/2021,Minister have given up his role last monday President is challenging the opposite leader,politics

As per the example I actually want to aggregate tweet text based on date and category. Below is how I used to aggregate without considering category, where I'm in need of aggregation as per the output above.It would be very helpful if anyone can answer this

import pandas as pd

def aggregated():
    tweets = pd.read_csv(r'data_set.csv')
    df = pd.DataFrame(tweets, columns=['created_at', 'tweet'])
    df['created_at'] = pd.to_datetime(df['created_at'])
    df['tweet'] = df['tweet'].apply(lambda x: str(x))
    pd.set_option('display.max_colwidth', 0)
    df = df.groupby(pd.Grouper(key='created_at', freq='1D')).agg(lambda x: ' '.join(set(x)))
    return df


# Driver code
if __name__ == '__main__':
    print(aggregated())
    aggregated().to_csv(r'agg-1.csv',index = True, header=True)

CodePudding user response：

You can use:

out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
        .apply(lambda x: ' '.join(x))[df.columns]
print(out)

Output:

>>> out
  created_at                                                                                                    tweet  category
0  7/29/2021  Great Sunny day for Cricket at London Great Score put on by England batting Olympic is to held in Japan    sports
1  7/29/2021                                     President Made a clear statement An election is to be kept next year  politics
2  7/29/2021                                                                 A terrorist attack have killed 10 people     crime
3  8/29/2021                                          Srilanka have lost the T20 series Australia have won the series    sports
4  8/29/2021                 Minister have given up his role last monday President is challenging the opposite leader  politics

CodePudding user response：

df is your example at first tweet column make list with groupby and join list by apply

df = df.groupby(["created_at", "category"], as_index=False)["tweet"].agg(lambda x: list(x))
df["tweet"] = df1["tweet"].apply(lambda x:" ".join(x))
df = df.reindex(columns=["created_at", "tweet", "category"])
df

output:

    created_at  tweet   category
0   7/29/2021   A terrorist attack have killed 10 people    crime
1   7/29/2021   President Made a clear statement An election i...   politics
2   7/29/2021   Great Sunny day for Cricket at London Great Sc...   sports
3   8/29/2021   Minister have given up his role last monday Pr...   politics
4   8/29/2021   Srilanka have lost the T20 series Australia ha...   sports