import pandas as pd
df = pd.DataFrame(
[
['China', '08/06/2022 20:00', '08/10/2022 20:00'],
['China', '8/13/2022 00:54', '8/14/2022 00:54'],
['China', '8/14/2022 00:54', '8/14/2022 12:54'],
['United Kingdom', '8/27/2022 06:36', '8/31/2022 21:08'],
['United Kingdom', '9/01/2022 21:08', '09/02/2022 21:38'],
['China', '09/04/2022 21:38', '09/06/2022 21:38']
],
columns=['Country', 'Arrival', 'Departure']
)
Supposed i have above dataframe, and i want to remove the consistent duplicate rows and replace the departure time with the last duplicates value, but do not remove other duplicates that are not in consistent manner.
so after removing it should look like this:
df = pd.DataFrame(
[
['China', '08/06/2022 20:00', '8/14/2022 12:54'],
['United Kingdom', '8/27/2022 06:36', '09/02/2022 21:38'],
['China', '09/04/2022 21:38', '09/06/2022 21:38']
]
columns=['Country', 'Arrival', 'Departure']
)
CodePudding user response:
You can try
out = (df.groupby(df['Country'].ne(df['Country'].shift()).cumsum(), group_keys=False)
.apply(lambda g: pd.DataFrame(
[[g['Country'].iloc[0], g['Arrival'].iloc[0], g['Departure'].iloc[-1]]]
, columns=g.columns)) # create one row dataframe from group where Arrival is first value and Departure is last value
.reset_index(drop=True))
print(out)
Country Arrival Departure
0 China 08/06/2022 20:00 8/14/2022 12:54
1 United Kingdom 8/27/2022 06:36 09/02/2022 21:38
2 China 09/04/2022 21:38 09/06/2022 21:38
CodePudding user response:
IIUC, use:
group = df['Country'].ne(df['Country'].shift()).cumsum()
out = (df.groupby(group, as_index=False)
.agg({'Country': 'first', 'Arrival': 'first', 'Departure': 'last'})
)
output:
Country Arrival Departure
0 China 08/06/2022 20:00 8/14/2022 12:54
1 United Kingdom 8/27/2022 06:36 09/02/2022 21:38
2 China 09/04/2022 21:38 09/06/2022 21:38