I have a chat dataframe which looks like this
I'm trying to merge chats of the dataframe where same user have replied multiple times to be merged as one. like this
author_username Content
Denise I want to die so bad. I don’t feel the need to do anything but
with an exam coming up, she threw me away like trash. With all the
pressure, I don’t want to live.
Kenton Please stay strong, I can feel you. My test just ended next week,
back then i feel i don't have hope, and when pandemic first
started. I lost contact With all my friends.
Denise Oh
Kenton But look at me now
Denise I cant see you
Kenton ? wdym?
Denise I can't see you
Kenton I know. That is a sentence that people use to make example of
themself. So I use that sentence
Denise Ok sry
I have tried
df[['content','author_username']].groupby(['author_username'])['content'].transform(lambda x: ','.join(x))
and this too
for index, row in df.iterrows():
if index > 0 and row['author_username'] == df.loc[index - 1]['author_username']:
df.loc[index]['concatenated_message'] = df.loc[index - 1]['concatenated_message'] ' ' row['content']
else:
df.loc[index]['concatenated_message'] = row['content']
but it doesn't produce the expected result.
Tried
df3 = df2.groupby(['author_username']).agg({'content': ' '.join})
but it just groups all chats under one name and not inorder.
CodePudding user response:
Setup
print(df)
author_username content
0 user1 x
1 user1 y
2 user2 z
3 user3 p
4 user3 q
5 user3 r
6 user2 s
7 user2 t
In order to merge the chats in the order of appearance of username you have to compare the current and previous rows in username column to create a boolean mask to identify the boundaries of change then use cumsum
on this mask to identify blocks of consecutive chats then simply group the dataframe on these blocks and aggregate
m = df['author_username'] != df['author_username'].shift()
df.groupby(m.cumsum(), as_index=False).agg({'author_username': 'first', 'content': ' '.join})
Result
author_username content
0 user1 x y
1 user2 z
2 user3 p q r
3 user2 s t