I'm trying to do the following with an SRT (subtitles) file:
- while a row does not appear on the screen for at least 5s
- add text from the next row to current row with a space between AND replace current End_Time with next row End_Time
- delete next row
- go to next row
I have to do that on the dataframe dfClean
with the edited timestamp fields and then do the same to the dataframe with the original SRT time format dfSRTForm
so I can export the latter later as an SRT file.
My code to do that is this:
for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] ' ' dfClean.at[i 1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] ' ' dfSRTForm.at[i 1, 'Text']
dfClean.at[i, 'End_Time'] = dfClean.at[i 1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i 1, 'End_Time']
dfClean = dfClean.drop(i 1)
dfSRTForm = dfSRTForm.drop(i 1)
But I get this error:
KeyError: 3
UPDATE (keeping previous if anyone else is having the same issue):
I found a way to reset the index to avoid KeyError: 3
My current code is:
for i in dfClean.index:
while dfClean.at[i, 'Difference'] < 5:
dfClean.at[i, 'Text'] = dfClean.at[i, 'Text'] ' ' dfClean.at[i 1, 'Text']
dfSRTForm.at[i, 'Text'] = dfSRTForm.at[i, 'Text'] ' ' dfSRTForm.at[i 1, 'Text']
dfClean.at[i, 'End_Time'] = dfClean.at[i 1, 'End_Time']
dfSRTForm.at[i, 'End_Time'] = dfSRTForm.at[i 1, 'End_Time']
dfClean = dfClean.drop(i 1)
dfSRTForm = dfSRTForm.drop(i 1)
dfClean = dfClean.reset_index()
dfClean = dfClean.drop(columns='index')
dfSRTForm = dfSRTForm.reset_index()
dfSRTForm = dfSRTForm.drop(columns='index')
dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')
But I get KeyError: 267
and I'm pretty sure it's because it condenses the rows to 266.
Is there a way to put "or end of index" or "or last row" in the while loop without hard coding the 266 lines? I want to use it for other SRT files with different varying number of rows.
CodePudding user response:
You can define an empty list, then loop over your dataframe rows and if it doesn't fulfil your condition save the index to that list.
After that do the following:
df = df.drop(index=your_indices)
CodePudding user response:
Without having a look at your data I cannot make a precise solution. But below should serve as an example of how to accomplish what you are doing
dfClean['Difference'] = (dfClean['End_Time'] - dfClean['Start_Time']).astype('timedelta64[s]')
tmp_diff = 0
tmp_txt = ''
new_data = []
for i, row in dfClean.iterrows():
if tmp_diff < 5:
tmp_txt = ' '.join([tmp_row, row['Text'])
tmp_diff = row['Difference']
else:
new_row = dict(row)
new_row['Text'] = tmp_txt
new_row['End_Time'] = row['End_Time']
new_row['Difference'] = tmp_diff
new_data.append(new_row)
tmp_txt = ''
tmp_diff = 0
new_df = pd.DataFrame(new_data)