Home > OS >  Concatenate and transform dialog data on pandas
Concatenate and transform dialog data on pandas

Time:10-18

I have a dialog data that looks like the table below:

speaker_label start_time end_time text
Speaker 0 00:00:06 00:00:06 Hi
Speaker 0 00:00:06 00:00:06 John
Speaker 0 00:00:06 00:00:06 ,
Speaker 0 00:00:06 00:00:06 how
Speaker 0 00:00:07 00:00:07 are
Speaker 0 00:00:07 00:00:07 you
Speaker 0 00:00:07 00:00:08 ?
Speaker 1 00:00:08 00:00:08 Hello
Speaker 1 00:00:08 00:00:08 I'm
Speaker 1 00:00:08 00:00:08 good
Speaker 1 00:00:09 00:00:09 .
Speaker 1 00:00:09 00:00:09 You
Speaker 1 00:00:09 00:00:09 ?
Speaker 0 00:00:10 00:00:10 Good
Speaker 0 00:00:10 00:00:10 ,
Speaker 0 00:00:10 00:00:10 good
Speaker 0 00:00:10 00:00:11 .

I need to transform the table to look like this:

speaker_label start_time end_time text
Speaker 0 00:00:06 00:00:07 Hi John, how are you?
Speaker 1 00:00:08 00:00:09 Hello I'm good. You?
Speaker 0 00:00:10 00:00:11 Good, good.

Somehow, the text column is being concatenated based on the speaker label. And then the start/end times will also be based on the speaker label.

Is there an efficient way (iterrows, itertuples, lambda) to transform my table to the desired state?

Thanks in advance to anyone who can provide ideas. You can also provide somehow similar answers if there's any.

CodePudding user response:

You can use the groupby function with different aggregation methods for each column :

df.groupby("speaker_label").agg({"start_time":min,"end_time":max, "text":" ".join})

CodePudding user response:

You can use a custom groupby.agg

group = df['speaker_label'].ne(df['speaker_label'].shift()).cumsum()
out = (df.groupby([group, 'speaker_label', 'start_time'], as_index=False)
         .agg({'start_time': 'min', 'end_time': 'max', 'text': ' '.join})
      )

output:

  speaker_label start_time  end_time                     text
0     Speaker 0   00:00:06  00:00:08  Hi John , how are you ?
1     Speaker 1   00:00:08  00:00:09   Hello I'm good . You ?
2     Speaker 0   00:00:10  00:00:11            Good , good .
  • Related