Home > other >  Merging rows with same username in order
Merging rows with same username in order

Time:12-15

I have a chat dataframe which looks like this Chat description

I'm trying to merge chats of the dataframe where same user have replied multiple times to be merged as one. like this

author_username    Content
Denise             I want to die so bad. I don’t feel the need to do anything but
                   with an exam coming up, she threw me away like trash. With all the 
                   pressure, I don’t want to live.
Kenton             Please stay strong, I can feel you. My test just ended next week,
                   back then i feel i don't have hope, and when pandemic first 
                   started. I lost contact With all my friends.
Denise             Oh
Kenton             But look at me now
Denise             I cant see you
Kenton             ? wdym?
Denise             I can't see you
Kenton             I know. That is a sentence that people use to make example of 
                   themself. So I use that sentence
Denise             Ok sry

I have tried

df[['content','author_username']].groupby(['author_username'])['content'].transform(lambda x: ','.join(x))

and this too

for index, row in df.iterrows():
  if index > 0 and row['author_username'] == df.loc[index - 1]['author_username']:
    df.loc[index]['concatenated_message'] = df.loc[index - 1]['concatenated_message']   ' '   row['content']
  else:
    df.loc[index]['concatenated_message'] = row['content']

but it doesn't produce the expected result.

Tried

df3 = df2.groupby(['author_username']).agg({'content': ' '.join})

but it just groups all chats under one name and not inorder.

CodePudding user response:

Setup

print(df)

  author_username content
0           user1       x
1           user1       y
2           user2       z
3           user3       p
4           user3       q
5           user3       r
6           user2       s
7           user2       t

In order to merge the chats in the order of appearance of username you have to compare the current and previous rows in username column to create a boolean mask to identify the boundaries of change then use cumsum on this mask to identify blocks of consecutive chats then simply group the dataframe on these blocks and aggregate

m = df['author_username'] != df['author_username'].shift()
df.groupby(m.cumsum(), as_index=False).agg({'author_username': 'first', 'content': ' '.join})

Result

  author_username content
0           user1     x y
1           user2       z
2           user3   p q r
3           user2     s t
  • Related