Home > Back-end >  Dataframe conditional drop duplicate rows
Dataframe conditional drop duplicate rows

Time:12-31

From this sample of my dataset, I need to drop duplicate rows that have that have same values in all columns except for "parent_station". The duplicate row that will be removed has to be the one with NaN in "parent_station" column and keep the row that has "parent_station" value different than NaN. In this example, the row that needs to be removed is the 4th row, with index of 7789. How can I do this ? I have not yet been able to figure out how.


           stop_id         stop_name                    parent_station              trip_id                 arrival_time   departure_time     stop_sequence   route_id      trip_headsign
7022       87413385        Gare de Yvetot               StopArea:OCE87413385        OCESN003100F140147152   05:49:00       05:50:00           2.0             OCE1506035    3100.0
3518       87411017        Gare de Rouen-Rive-Droite    StopArea:OCE87411017        OCESN003100F140147152   06:12:00       06:15:00           3.0             OCE1506035    3100.0
8040       87413013        Gare de Le Havre             StopArea:OCE87413013        OCESN003100F140147152   05:20:00       05:20:00           0.0             OCE1506035    3100.0
7789       87413013        Gare de Le Havre             NaN                         OCESN003100F140147152   05:20:00       05:20:00           0.0             OCE1506035    3100.0
7197       87413344        Gare de Bréauté-Beuzeville   NaN                         OCESN003100F140147152   05:35:00       05:36:00           1.0             OCE1506035    3100.0

CodePudding user response:

You can use a boolean mask:

out = df[~df.drop('parent_station', axis=1).duplicated(keep=False) | pd.notna(df['parent_station'])]

Output:

        stop_id                   stop_name        parent_station  \
index                                                               
7022   87413385              Gare de Yvetot  StopArea:OCE87413385   
3518   87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017   
8040   87413013            Gare de Le Havre  StopArea:OCE87413013   
7197   87413344  Gare de Bréauté-Beuzeville                   NaN   

                     trip_id arrival_time departure_time  stop_sequence  \
index                                                                     
7022   OCESN003100F140147152     05:49:00       05:50:00            2.0   
3518   OCESN003100F140147152     06:12:00       06:15:00            3.0   
8040   OCESN003100F140147152     05:20:00       05:20:00            0.0   
7197   OCESN003100F140147152     05:35:00       05:36:00            1.0   

         route_id  trip_headsign  
index                             
7022   OCE1506035         3100.0  
3518   OCE1506035         3100.0  
8040   OCE1506035         3100.0  
7197   OCE1506035         3100.0  

CodePudding user response:

Use drop_duplicates:

cols = df.columns[df.columns != 'parent_station']
out = df[~(df.duplicated(cols, keep=False) & df['parent_station'].isna())]
print(out)

# Output
       stop_id                   stop_name        parent_station                trip_id arrival_time departure_time  stop_sequence    route_id  trip_headsign
7022  87413385              Gare de Yvetot  StopArea:OCE87413385  OCESN003100F140147152     05:49:00       05:50:00            2.0  OCE1506035         3100.0
3518  87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017  OCESN003100F140147152     06:12:00       06:15:00            3.0  OCE1506035         3100.0
8040  87413013            Gare de Le Havre  StopArea:OCE87413013  OCESN003100F140147152     05:20:00       05:20:00            0.0  OCE1506035         3100.0
7197  87413344  Gare de Bréauté-Beuzeville                   NaN  OCESN003100F140147152     05:35:00       05:36:00            1.0  OCE1506035         3100.0

CodePudding user response:

Very Easy solution! I hope this is what you were looking for:

import pandas as pd
import numpy as np

df = pd.DataFrame({'stop_id': ['87413385', '87411017', '87413013', '87413013', '87413344'],
                   'stop_name': ['Gare de Yvetot', 'Gare de Rouen-Rive-Droite', 'Gare de Le Havre', 'Gare de Le Havre', 'Gare de Bréauté-Beuzeville'],
                   'parent_station': ['StopArea:OCE87413385', 'StopArea:OCE87411017', 'StopArea:OCE87413013', np.NaN, np.NaN]})
is_duplacted = df.duplicated(subset=['stop_id', 'stop_name'])
is_nan = df['parent_station'].isna()
print(df[~(is_duplacted & is_nan)])


#     stop_id                   stop_name        parent_station
# 0  87413385              Gare de Yvetot  StopArea:OCE87413385
# 1  87411017   Gare de Rouen-Rive-Droite  StopArea:OCE87411017
# 2  87413013            Gare de Le Havre  StopArea:OCE87413013
# 4  87413344  Gare de Bréauté-Beuzeville                   NaN

CodePudding user response:

Even though it's slightly longer than the explanations above.

ss = df.columns.drop('parent_station')

keep_rows = df[(df.duplicated(subset=ss, keep=False)) & (~df.parent_station.isna())]
non_duplicates = df[~(df.duplicated(subset=ss, keep=False))]
df = pd.concat([non_duplicates, keep_rows])

Here I make a clear distinction between the rows which have no double at all (non_duplicates) and the duplicate rows that you want to keep (keep_rows).

CodePudding user response:

First sort_values and then drop_duplicates. while sorting you have an option to keep NaN values to first or last places. Default it is 'last'. Similar param named keep we have in drop_duplicates.

References: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

output_df = df.sort_values(['parent_station'], na_position='last').drop_duplicates(['stop_id', 'stop_name'], keep='first')
  • Related