How to replace incomplete rows from complete rows per group in pandas-CodePudding

Trying to clean a dataset and I suspect I'm dealing with incomplete rows with the right information elsewhere in the data frame. For example, something like

ride_id	start_station_name	start_lat	start_lng
12398213	Clark & Vermont	85.56	40.34
12398129	NaN	85.56	40.34

This would just be one of many such cases (for multiple stations). Curious how you guys might go about searching through the data frame replacing the "NaN" with "Clark and Vermont" using start_lat and start_lng.

CodePudding user response：

Group by the latitude and longitude, then forward-fill and backward-fill the station name.

Using a slightly bigger dataframe for demonstration:

df = pd.DataFrame({'ride_id': range(100, 105), 'station_name': ['Clark & Vermont', np.nan, np.nan, 'Foo & Bar', np.nan], 'start_lat': [85.56]*2   [15.59]*3, 'start_lng': [40.34]*2   [20.30]*3})

#    ride_id     station_name  start_lat  start_lng
# 0      100  Clark & Vermont      85.56      40.34
# 1      101              NaN      85.56      40.34
# 2      102              NaN      15.59      20.30
# 3      103        Foo & Bar      15.59      20.30
# 4      104              NaN      15.59      20.30

Output:

df['station_name'] = df.groupby(['start_lat', 'start_lng'])['station_name'].ffill().bfill()

#    ride_id     station_name  start_lat  start_lng
# 0      100  Clark & Vermont      85.56      40.34
# 1      101  Clark & Vermont      85.56      40.34
# 2      102        Foo & Bar      15.59      20.30
# 3      103        Foo & Bar      15.59      20.30
# 4      104        Foo & Bar      15.59      20.30