Trying to clean a dataset and I suspect I'm dealing with incomplete rows with the right information elsewhere in the data frame. For example, something like
ride_id | start_station_name | start_lat | start_lng |
---|---|---|---|
12398213 | Clark & Vermont | 85.56 | 40.34 |
12398129 | NaN | 85.56 | 40.34 |
This would just be one of many such cases (for multiple stations). Curious how you guys might go about searching through the data frame replacing the "NaN" with "Clark and Vermont" using start_lat
and start_lng
.
CodePudding user response:
Group by the latitude and longitude, then forward-fill and backward-fill the station name.
Using a slightly bigger dataframe for demonstration:
df = pd.DataFrame({'ride_id': range(100, 105), 'station_name': ['Clark & Vermont', np.nan, np.nan, 'Foo & Bar', np.nan], 'start_lat': [85.56]*2 [15.59]*3, 'start_lng': [40.34]*2 [20.30]*3})
# ride_id station_name start_lat start_lng
# 0 100 Clark & Vermont 85.56 40.34
# 1 101 NaN 85.56 40.34
# 2 102 NaN 15.59 20.30
# 3 103 Foo & Bar 15.59 20.30
# 4 104 NaN 15.59 20.30
Output:
df['station_name'] = df.groupby(['start_lat', 'start_lng'])['station_name'].ffill().bfill()
# ride_id station_name start_lat start_lng
# 0 100 Clark & Vermont 85.56 40.34
# 1 101 Clark & Vermont 85.56 40.34
# 2 102 Foo & Bar 15.59 20.30
# 3 103 Foo & Bar 15.59 20.30
# 4 104 Foo & Bar 15.59 20.30