Pandas - collapse rows on cell value but retain order-CodePudding

I have a transcription from an interview which I've converted into a df (I have only recently started learning the pandas framework). The structure of the df is as follows:

Timestamp                   Content Speaker
00:00:00.000-00:00:01.100   Content Speaker 1
00:00:03.470-00:00:07.760   Content Speaker 1
00:00:09.340-00:00:13.690   Content Speaker 1
00:00:13.690-00:00:13.760   Content Speaker 2
00:00:14.550-00:00:14.880   Content Speaker 2
00:00:17.000-00:00:30.510   Content Speaker 1
00:00:30.740-00:00:41.930   Content Speaker 2

What I want to achieve is to collapse/merge (sorry not sure which term is most suitable here) across speakers and and retain the start and end time for that part of the speaker. The output should look like the one below:

**Timestamp                 Content                 Speaker**
00:00:00.000-00:00:13.690   ContentContentContent   Speaker 1
00:00:13.690-00:00:14.880   ContentContent          Speaker 2
00:00:14.880-00:00:30.510   Content                 Speaker 1
00:00:30.740-00:00:41.930   Content                 Speaker 2

I don't think I can use groupby because it will not retain the hierarchical order of the speakers e.g. it will place all the rows for speaker 1 on top, then all the rows of speaker 2 further down. And from my understanding I cannot use merge or concatenate. I have the feeling this will be a much more complex operation than I was expecting. Any help/advice/pointers will be much appreciated. Thank you in advance.

CodePudding user response：

You can use Groupby.agg and use the non equality between successive rows as grouper:

(df
 .groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum(),
          as_index=False
         )
 .agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
)

output:

                   Timestamp                Content    Speaker
0  00:00:09.340-00:00:13.690  ContentContentContent  Speaker 1
1  00:00:14.550-00:00:14.880         ContentContent  Speaker 2
2  00:00:17.000-00:00:30.510                Content  Speaker 1
3  00:00:30.740-00:00:41.930                Content  Speaker 2

update: reworking the Timestamps:

df2 = (df
 .groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
 .agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first'})
 .assign(Timestamp=lambda d: (s:=d['Timestamp'].str.extract('-(.*)', expand=False))
                              .shift(fill_value='00:00:00.000')
                              '-' s
        )
)

output:

                         Timestamp                Content    Speaker
Speaker                                                             
1        00:00:00.000-00:00:13.690  ContentContentContent  Speaker 1
2        00:00:13.690-00:00:14.880         ContentContent  Speaker 2
3        00:00:14.880-00:00:30.510                Content  Speaker 1
4        00:00:30.510-00:00:41.930                Content  Speaker 2

CodePudding user response：

You can compare adjacent cells and filter the df

UPDATE Due to comments, here the update. Doing so I build my answer off of @mozway's answer since groupby is needed now (I think)

df[['start', 'end']] = df['Timestamp'].str.split('-',expand=True)
res = (df
 .groupby(df['Speaker'].ne(df['Speaker'].shift()).cumsum())
 .agg({'Timestamp': 'max', 'Content': ''.join, 'Speaker': 'first', 'start':'first', 'end':'last'})
 .assign(Timestamp= lambda x: x[['start', 'end']].apply('-'.join, axis=1))
 .drop(['start', 'end'],axis=1)
)

print(res)

                         Timestamp                Content    Speaker
Speaker                                                             
1        00:00:00.000-00:00:13.690  ContentContentContent  Speaker 1
2        00:00:13.690-00:00:14.880         ContentContent  Speaker 2
3        00:00:17.000-00:00:30.510                Content  Speaker 1
4        00:00:30.740-00:00:41.930                Content  Speaker 2

Old answer:

res = df[df['Speaker'].ne(df['Speaker'].shift(-1))]

print(res)
                   Timestamp  Content    Speaker
2  00:00:09.340-00:00:13.690  Content  Speaker 1
4  00:00:14.550-00:00:14.880  Content  Speaker 2
5  00:00:17.000-00:00:30.510  Content  Speaker 1
6  00:00:30.740-00:00:41.930  Content  Speaker 2

Here is how that works:

mask = df['Speaker'].ne(df['Speaker'].shift(-1))
print(mask)

0    False
1    False
2     True
3    False
4     True
5     True
6     True
Name: Speaker, dtype: bool

The mask is only True when the speaker is changing, so you will always get the last row of each speaker.