I have a transcription from an interview which I've converted into a df (I have only recently started learning the pandas framework). The structure of the df is as follows:
Timestamp Content Speaker
00:00:00.000-00:00:01.100 Content Speaker 1
00:00:03.470-00:00:07.760 Content Speaker 1
00:00:09.340-00:00:13.690 Content Speaker 1
00:00:13.690-00:00:13.760 Content Speaker 2
00:00:14.550-00:00:14.880 Content Speaker 2
00:00:17.000-00:00:30.510 Content Speaker 1
00:00:30.740-00:00:41.930 Content Speaker 2
What I want to achieve is to collapse/merge (sorry not sure which term is most suitable here) across speakers and and retain the start and end time for that part of the speaker. The output should look like the one below:
**Timestamp Content Speaker**
00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
00:00:13.690-00:00:14.880 ContentContent Speaker 2
00:00:14.880-00:00:30.510 Content Speaker 1
00:00:30.740-00:00:41.930 Content Speaker 2
I don't think I can use groupby
because it will not retain the hierarchical order of the speakers e.g. it will place all the rows for speaker 1 on top, then all the rows of speaker 2 further down. And from my understanding I cannot use merge
or concatenate
. I have the feeling this will be a much more complex operation than I was expecting. Any help/advice/pointers will be much appreciated. Thank you in advance.
CodePudding user response:
You can use Groupby.agg
and use the non equality between successive rows as grouper:
(df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum(),
as_index=False
)
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
)
output:
Timestamp Content Speaker
0 00:00:09.340-00:00:13.690 ContentContentContent Speaker 1
1 00:00:14.550-00:00:14.880 ContentContent Speaker 2
2 00:00:17.000-00:00:30.510 Content Speaker 1
3 00:00:30.740-00:00:41.930 Content Speaker 2
update: reworking the Timestamps:
df2 = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
.assign(Timestamp=lambda d: (s:=d['Timestamp'].str.extract('-(.*)', expand=False))
.shift(fill_value='00:00:00.000')
'-' s
)
)
output:
Timestamp Content Speaker
Speaker
1 00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
2 00:00:13.690-00:00:14.880 ContentContent Speaker 2
3 00:00:14.880-00:00:30.510 Content Speaker 1
4 00:00:30.510-00:00:41.930 Content Speaker 2
CodePudding user response:
You can compare adjacent cells and filter the df
UPDATE
Due to comments, here the update. Doing so I build my answer off of @mozway's answer since groupby
is needed now (I think)
df[['start', 'end']] = df['Timestamp'].str.split('-',expand=True)
res = (df
.groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
.agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first', 'start':'first', 'end':'last'})
.assign(Timestamp= lambda x: x[['start', 'end']].apply('-'.join, axis=1))
.drop(['start', 'end'],axis=1)
)
print(res)
Timestamp Content Speaker
Speaker
1 00:00:00.000-00:00:13.690 ContentContentContent Speaker 1
2 00:00:13.690-00:00:14.880 ContentContent Speaker 2
3 00:00:17.000-00:00:30.510 Content Speaker 1
4 00:00:30.740-00:00:41.930 Content Speaker 2
Old answer:
res = df[df['Speaker'].ne(df['Speaker'].shift(-1))]
print(res)
Timestamp Content Speaker
2 00:00:09.340-00:00:13.690 Content Speaker 1
4 00:00:14.550-00:00:14.880 Content Speaker 2
5 00:00:17.000-00:00:30.510 Content Speaker 1
6 00:00:30.740-00:00:41.930 Content Speaker 2
Here is how that works:
mask = df['Speaker'].ne(df['Speaker'].shift(-1))
print(mask)
0 False
1 False
2 True
3 False
4 True
5 True
6 True
Name: Speaker, dtype: bool
The mask is only True
when the speaker is changing, so you will always get the last row of each speaker.