I have a dialog data that looks like the table below:
speaker_label | start_time | end_time | text |
---|---|---|---|
Speaker 0 | 00:00:06 | 00:00:06 | Hi |
Speaker 0 | 00:00:06 | 00:00:06 | John |
Speaker 0 | 00:00:06 | 00:00:06 | , |
Speaker 0 | 00:00:06 | 00:00:06 | how |
Speaker 0 | 00:00:07 | 00:00:07 | are |
Speaker 0 | 00:00:07 | 00:00:07 | you |
Speaker 0 | 00:00:07 | 00:00:08 | ? |
Speaker 1 | 00:00:08 | 00:00:08 | Hello |
Speaker 1 | 00:00:08 | 00:00:08 | I'm |
Speaker 1 | 00:00:08 | 00:00:08 | good |
Speaker 1 | 00:00:09 | 00:00:09 | . |
Speaker 1 | 00:00:09 | 00:00:09 | You |
Speaker 1 | 00:00:09 | 00:00:09 | ? |
Speaker 0 | 00:00:10 | 00:00:10 | Good |
Speaker 0 | 00:00:10 | 00:00:10 | , |
Speaker 0 | 00:00:10 | 00:00:10 | good |
Speaker 0 | 00:00:10 | 00:00:11 | . |
I need to transform the table to look like this:
speaker_label | start_time | end_time | text |
---|---|---|---|
Speaker 0 | 00:00:06 | 00:00:07 | Hi John, how are you? |
Speaker 1 | 00:00:08 | 00:00:09 | Hello I'm good. You? |
Speaker 0 | 00:00:10 | 00:00:11 | Good, good. |
Somehow, the text column is being concatenated based on the speaker label. And then the start/end times will also be based on the speaker label.
Is there an efficient way (iterrows, itertuples, lambda) to transform my table to the desired state?
Thanks in advance to anyone who can provide ideas. You can also provide somehow similar answers if there's any.
CodePudding user response:
You can use the groupby function with different aggregation methods for each column :
df.groupby("speaker_label").agg({"start_time":min,"end_time":max, "text":" ".join})
CodePudding user response:
You can use a custom groupby.agg
group = df['speaker_label'].ne(df['speaker_label'].shift()).cumsum()
out = (df.groupby([group, 'speaker_label', 'start_time'], as_index=False)
.agg({'start_time': 'min', 'end_time': 'max', 'text': ' '.join})
)
output:
speaker_label start_time end_time text
0 Speaker 0 00:00:06 00:00:08 Hi John , how are you ?
1 Speaker 1 00:00:08 00:00:09 Hello I'm good . You ?
2 Speaker 0 00:00:10 00:00:11 Good , good .