I have a Pandas DataFrame shown below consisting of three columns.
import pandas as pd
data = [[1, "User1", "Hello."], [1, "User1", "How are you?"], [1, "User2", "I'm fine."], [2, "User1", "Nice to meet you."], [2, "User2", "Hello."], [2, "User2", "I'm happy."], [2, "User2", "Goodbye."], [3, "User2", "Hello."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])
Conversation User Text
0 1 User1 Hello.
1 1 User1 How are you?
2 1 User2 I'm fine.
3 2 User1 Nice to meet you.
4 2 User2 Hello.
5 2 User2 I'm happy.
6 2 User2 Goodbye.
7 3 User2 Hello.
I would like to merge the Text of groups of consecutive Users, but not over conversation boundaries. If in a Conversation a User has several consecutive rows, I would like to merge these rows into one row by combining the Text with whitespace. When a new Conversation starts, it should not be combined. For the example, the result should look as follows:
Conversation User Text
0 1 User1 Hello. How are you?
2 1 User2 I'm fine.
3 2 User1 Nice to meet you.
4 2 User2 Hello. I'm happy. Goodbye.
7 3 User2 Hello.
How can this be achieved in an efficient way (I have a big DataFrame)?
CodePudding user response:
I've changed the example to include the conditions that were specified. One way to do this is to introduce a temporary boundary column, and then group by the boundary, conversation and user. If so desired the boundary column can be removed.
import pandas as pd
data = [
[1, "User1", "Hello."],
[1, "User2", "How are you?"],
[1, "User1", "I'm fine."],
[2, "User1", "Nice to meet you."],
[2, "User2", "Hello."],
[2, "User2", "I'm happy."],
[2, "User2", "Goodbye."],
[1, "User1", "Bye now."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])
df['Boundary'] = ((df['Conversation'].astype(str) df['User']) != (df['Conversation'].astype(str) df['User']).shift()).cumsum()
print(df.groupby(['Boundary','Conversation', 'User'], as_index=False, sort=True).agg(' '.join).drop(columns=['Boundary']))
Output:
Conversation User Text
0 1 User1 Hello.
1 1 User2 How are you?
2 1 User1 I'm fine.
3 2 User1 Nice to meet you.
4 2 User2 Hello. I'm happy. Goodbye.
5 1 User1 Bye now.