I have a following dataframe:
Start | Finish |
---|---|
16.11.2021 21:36:31 | 16.11.2021 21:37:31 |
16.11.2021 21:39:31 | 16.11.2021 21:40:28 |
16.11.2021 21:41:39 | 16.11.2021 21:40:31 |
16.11.2021 21:56:31 | 16.11.2021 21:48:31 |
16.11.2021 21:50:35 | |
16.11.2021 21:58:56 |
I have errors in my dataset - finish time occurs before start time in some rows due to bad data. In rows where that occurs, I'd like to remove the cell in the 'Finish' column and shift other cells up until start time <= finish time. Therefore, the expected output would be:
Start | Finish |
---|---|
16.11.2021 21:36:31 | 16.11.2021 21:37:31 |
16.11.2021 21:39:31 | 16.11.2021 21:40:28 |
16.11.2021 21:41:39 | 16.11.2021 21:48:31 |
16.11.2021 21:56:31 | 16.11.2021 21:58:56 |
My attempt:
df['start_time'] = pd.to_datetime(df['start_time'])
df['finish_time'] = pd.to_datetime(df['finish_time'])
for i,row in df.iterrows():
if row["start_time"] > row["finish_time"]:
row["finish_time"] = ""
row.dropna()
df.at[i, 'finish_time'] = row["finish_time"]
I appreciate any help in advance.
CodePudding user response:
You can use pandas.merge_asof
.
This is assuming that the values are sorted in both columns (if not, add a sort_values()
step).
df['Start'] = pd.to_datetime(df['Start'])
df['Finish'] = pd.to_datetime(df['Finish'])
pd.merge_asof(df['Start'].dropna(), df['Finish'],
left_on='Start', right_on='Finish', direction='forward')
Output:
Start Finish
0 2021-11-16 21:36:31 2021-11-16 21:37:31
1 2021-11-16 21:39:31 2021-11-16 21:40:28
2 2021-11-16 21:41:39 2021-11-16 21:48:31
3 2021-11-16 21:56:31 2021-11-16 21:58:56