Home > Enterprise >  Combining rows in a DataFrame
Combining rows in a DataFrame

Time:01-01

I have a Pandas DataFrame shown below consisting of three columns.

import pandas as pd
data = [[1, "User1", "Hello."], [1, "User1", "How are you?"], [1, "User2", "I'm fine."], [2, "User1", "Nice to meet you."], [2, "User2", "Hello."], [2, "User2", "I'm happy."], [2, "User2", "Goodbye."], [3, "User2", "Hello."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])
   Conversation   User              Text
0             1  User1             Hello.
1             1  User1      How are you?
2             1  User2         I'm fine.
3             2  User1  Nice to meet you.
4             2  User2             Hello.
5             2  User2         I'm happy.
6             2  User2           Goodbye.
7             3  User2             Hello.

I would like to merge the Text of groups of consecutive Users, but not over conversation boundaries. If in a Conversation a User has several consecutive rows, I would like to merge these rows into one row by combining the Text with whitespace. When a new Conversation starts, it should not be combined. For the example, the result should look as follows:

   Conversation   User              Text
0             1  User1             Hello. How are you?
2             1  User2         I'm fine.
3             2  User1  Nice to meet you.
4             2  User2             Hello. I'm happy. Goodbye.
7             3  User2             Hello.

How can this be achieved in an efficient way (I have a big DataFrame)?

CodePudding user response:

I've changed the example to include the conditions that were specified. One way to do this is to introduce a temporary boundary column, and then group by the boundary, conversation and user. If so desired the boundary column can be removed.

import pandas as pd

data = [
    [1, "User1", "Hello."], 
    [1, "User2", "How are you?"], 
    [1, "User1", "I'm fine."], 
    [2, "User1", "Nice to meet you."], 
    [2, "User2", "Hello."], 
    [2, "User2", "I'm happy."], 
    [2, "User2", "Goodbye."], 
    [1, "User1", "Bye now."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])


df['Boundary'] = ((df['Conversation'].astype(str)   df['User']) != (df['Conversation'].astype(str)   df['User']).shift()).cumsum()
print(df.groupby(['Boundary','Conversation', 'User'], as_index=False, sort=True).agg(' '.join).drop(columns=['Boundary']))

Output:

   Conversation   User                        Text
0             1  User1                      Hello.
1             1  User2                How are you?
2             1  User1                   I'm fine.
3             2  User1           Nice to meet you.
4             2  User2  Hello. I'm happy. Goodbye.
5             1  User1                    Bye now.
  • Related