From this sample of my dataset, I need to drop duplicate rows that have that have same values in all columns except for "parent_station". The duplicate row that will be removed has to be the one with NaN in "parent_station" column and keep the row that has "parent_station" value different than NaN. In this example, the row that needs to be removed is the 4th row, with index of 7789. How can I do this ? I have not yet been able to figure out how.
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7789 87413013 Gare de Le Havre NaN OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
CodePudding user response:
You can use a boolean mask:
out = df[~df.drop('parent_station', axis=1).duplicated(keep=False) | pd.notna(df['parent_station'])]
Output:
stop_id stop_name parent_station \
index
7022 87413385 Gare de Yvetot StopArea:OCE87413385
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
8040 87413013 Gare de Le Havre StopArea:OCE87413013
7197 87413344 Gare de Bréauté-Beuzeville NaN
trip_id arrival_time departure_time stop_sequence \
index
7022 OCESN003100F140147152 05:49:00 05:50:00 2.0
3518 OCESN003100F140147152 06:12:00 06:15:00 3.0
8040 OCESN003100F140147152 05:20:00 05:20:00 0.0
7197 OCESN003100F140147152 05:35:00 05:36:00 1.0
route_id trip_headsign
index
7022 OCE1506035 3100.0
3518 OCE1506035 3100.0
8040 OCE1506035 3100.0
7197 OCE1506035 3100.0
CodePudding user response:
Use drop_duplicates
:
cols = df.columns[df.columns != 'parent_station']
out = df[~(df.duplicated(cols, keep=False) & df['parent_station'].isna())]
print(out)
# Output
stop_id stop_name parent_station trip_id arrival_time departure_time stop_sequence route_id trip_headsign
7022 87413385 Gare de Yvetot StopArea:OCE87413385 OCESN003100F140147152 05:49:00 05:50:00 2.0 OCE1506035 3100.0
3518 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017 OCESN003100F140147152 06:12:00 06:15:00 3.0 OCE1506035 3100.0
8040 87413013 Gare de Le Havre StopArea:OCE87413013 OCESN003100F140147152 05:20:00 05:20:00 0.0 OCE1506035 3100.0
7197 87413344 Gare de Bréauté-Beuzeville NaN OCESN003100F140147152 05:35:00 05:36:00 1.0 OCE1506035 3100.0
CodePudding user response:
Very Easy solution! I hope this is what you were looking for:
import pandas as pd
import numpy as np
df = pd.DataFrame({'stop_id': ['87413385', '87411017', '87413013', '87413013', '87413344'],
'stop_name': ['Gare de Yvetot', 'Gare de Rouen-Rive-Droite', 'Gare de Le Havre', 'Gare de Le Havre', 'Gare de Bréauté-Beuzeville'],
'parent_station': ['StopArea:OCE87413385', 'StopArea:OCE87411017', 'StopArea:OCE87413013', np.NaN, np.NaN]})
is_duplacted = df.duplicated(subset=['stop_id', 'stop_name'])
is_nan = df['parent_station'].isna()
print(df[~(is_duplacted & is_nan)])
# stop_id stop_name parent_station
# 0 87413385 Gare de Yvetot StopArea:OCE87413385
# 1 87411017 Gare de Rouen-Rive-Droite StopArea:OCE87411017
# 2 87413013 Gare de Le Havre StopArea:OCE87413013
# 4 87413344 Gare de Bréauté-Beuzeville NaN
CodePudding user response:
Even though it's slightly longer than the explanations above.
ss = df.columns.drop('parent_station')
keep_rows = df[(df.duplicated(subset=ss, keep=False)) & (~df.parent_station.isna())]
non_duplicates = df[~(df.duplicated(subset=ss, keep=False))]
df = pd.concat([non_duplicates, keep_rows])
Here I make a clear distinction between the rows which have no double at all (non_duplicates) and the duplicate rows that you want to keep (keep_rows).
CodePudding user response:
First sort_values
and then drop_duplicates
. while sorting you have an option to keep NaN values to first or last places. Default it is 'last'. Similar param named keep
we have in drop_duplicates.
References: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
output_df = df.sort_values(['parent_station'], na_position='last').drop_duplicates(['stop_id', 'stop_name'], keep='first')