Home > Blockchain >  How to replace incomplete rows from complete rows per group in pandas
How to replace incomplete rows from complete rows per group in pandas

Time:03-02

Trying to clean a dataset and I suspect I'm dealing with incomplete rows with the right information elsewhere in the data frame. For example, something like

ride_id start_station_name start_lat start_lng
12398213 Clark & Vermont 85.56 40.34
12398129 NaN 85.56 40.34

This would just be one of many such cases (for multiple stations). Curious how you guys might go about searching through the data frame replacing the "NaN" with "Clark and Vermont" using start_lat and start_lng.

CodePudding user response:

Group by the latitude and longitude, then forward-fill and backward-fill the station name.

Using a slightly bigger dataframe for demonstration:

df = pd.DataFrame({'ride_id': range(100, 105), 'station_name': ['Clark & Vermont', np.nan, np.nan, 'Foo & Bar', np.nan], 'start_lat': [85.56]*2   [15.59]*3, 'start_lng': [40.34]*2   [20.30]*3})

#    ride_id     station_name  start_lat  start_lng
# 0      100  Clark & Vermont      85.56      40.34
# 1      101              NaN      85.56      40.34
# 2      102              NaN      15.59      20.30
# 3      103        Foo & Bar      15.59      20.30
# 4      104              NaN      15.59      20.30

Output:

df['station_name'] = df.groupby(['start_lat', 'start_lng'])['station_name'].ffill().bfill()

#    ride_id     station_name  start_lat  start_lng
# 0      100  Clark & Vermont      85.56      40.34
# 1      101  Clark & Vermont      85.56      40.34
# 2      102        Foo & Bar      15.59      20.30
# 3      103        Foo & Bar      15.59      20.30
# 4      104        Foo & Bar      15.59      20.30
  • Related