I have two datasets with different lengths (see below) with lat/lon for different dates. I want to use each location of flooding and calculate distance from that to all storm surge locations for each date and select the location of smallest distance. For example: in first dataset I have one entry for 1979-01-24
and in second dataset I have five entries for same date. I want to use the location coor (flood)
from the first dataset and calculate distances to each entries in second dataset and then select the location (with all data) which is closest to the coor (flood)
.
flood_df.head()
lon_flood lat_flood Percentiles_moisture_flood Percentiles_precip_flood Percentiles_discharge_flood Sum_perc_flood coor (flood) Hazard_start
0 -80.78125 25.15625 0.777194 0.849623 0.960600 2.587417 (-80.78125, 25.15625) 1979-01-24
1 -80.65625 25.28125 0.694346 0.960126 0.980657 2.635129 (-80.65625, 25.28125) 1979-04-29
2 -80.40625 25.28125 0.974628 0.974852 0.933265 2.882745 (-80.40625, 25.28125) 1979-09-03
3 -80.40625 25.28125 0.994069 0.995098 0.995548 2.984716 (-80.40625, 25.28125) 1979-09-12
4 -80.40625 25.28125 0.895422 0.998689 0.997343 2.891454 (-80.40625, 25.28125) 1981-08-16
surge_df.head()
latitude longitude surge Date Percentiles coor (surge)
0 25.737 -80.142 96.0 1979-01-24 0.981429 (-80.142, 25.737)
1 25.591 -80.142 91.0 1979-01-24 0.978333 (-80.142, 25.591)
2 25.942 -80.112 103.0 1979-01-24 0.985455 (-80.112, 25.942)
3 25.415 -80.171 88.0 1979-01-24 0.977000 (-80.171, 25.415)
4 25.327 -80.229 95.0 1979-01-24 0.982143 (-80.229, 25.327)
I am using following code to calculate the distances and it's giving me the distances but I also want to have their associated percentiles of surge and other rows.
# libraries
import numpy as np
import pandas as pd
# progress bar
from tqdm.notebook import tqdm, trange
# dealing with date and time
from datetime import datetime as dt
from datetime import datetime, timedelta
# for calculating distance
import haversine as hs
# calculate distance
distance_coord_flood_surge = []
for i, r in zip(locations_flood_surge['coor (flood)'], locations_flood_surge['coor (surge)'] ):
# print(i, r)
distance = hs.haversine(i, r)
distance_coord_flood_surge.append(distance)
I also tried as below but still I am not getting all the information.
# find nearest storm surge location to flood location
# dataframe
surge_flood_dist_df = pd.DataFrame()
# get the beginning and ending date of an event
for i in tqdm(zip(range(len(events_dates_2018)))):
start_date = events_dates_2018['Hazard_start'].iloc[i]
end_date = events_dates_2018['Hazard_end'].iloc[i]
#lat and lon
# coord_flood = flood_vars_output_df['coor (flood)']
# get the first and last date of the window
first_window_date = start_date - timedelta(days=1)
last_window_date = end_date timedelta(days=1)
# get the date range of the window
window_dates = pd.date_range(first_window_date, last_window_date)
# get the values of the window
window_values_surge = surge_df.loc[surge_df.index.intersection(window_dates)]
window_values_flood = flood_df.loc[flood_vars_output_df.index.intersection(window_dates)]
# get the closest storm surge location to the flood location
for j, k in zip(window_values_flood['coor (flood)'], window_values_surge['coor (surge)']):
distance = hs.haversine(j, k)
# select the closest storm surge location to the flood location and append to the dataframe
surge_flood_dist_df.append(distance)
Could you please help me on how to do it? I will appreciate any help. Thank you.
CodePudding user response:
IIUC, merge your two dataframes on date columns then compute the haversine distance and keep only the closest distance for a same date:
from haversine import haversine_vector
# Merge
out = flood_df.merge(surge_df, left_on='Hazard_start', right_on='Date', how='left')
# Haversine
out['haversine'] = haversine_vector(out[['lon_flood', 'lat_flood']],
out[['longitude', 'latitude']])
# Select the closest
out = out.loc[out.groupby('Hazard_start', dropna=False)['haversine']
.rank(method='dense').fillna(1)
.loc[lambda x: x==1].index]
Output:
>>> out
lon_flood lat_flood Percentiles_moisture_flood Percentiles_precip_flood Percentiles_discharge_flood Sum_perc_flood coor (flood) Hazard_start latitude longitude surge Date Percentiles coor (surge) haversine
4 -80.78125 25.15625 0.777194 0.849623 0.960600 2.587417 (-80.78125, 25.15625) 1979-01-24 25.327 -80.229 95.0 1979-01-24 0.982143 (-80.229, 25.327) 61.487236
5 -80.65625 25.28125 0.694346 0.960126 0.980657 2.635129 (-80.65625, 25.28125) 1979-04-29 NaN NaN NaN NaN NaN NaN NaN
6 -80.40625 25.28125 0.974628 0.974852 0.933265 2.882745 (-80.40625, 25.28125) 1979-09-03 NaN NaN NaN NaN NaN NaN NaN
7 -80.40625 25.28125 0.994069 0.995098 0.995548 2.984716 (-80.40625, 25.28125) 1979-09-12 NaN NaN NaN NaN NaN NaN NaN
8 -80.40625 25.28125 0.895422 0.998689 0.997343 2.891454 (-80.40625, 25.28125) 1981-08-16 NaN NaN NaN NaN NaN NaN NaN
Update
Is there a way to use a time window on surge_df based on flood_df (such as allowing few days before and after the Hazard_start) to calculate the distances within the window and select values with minimum distance.
Use how='cross'
as parameter of merge
to create the cartesian product of dataframes then filter rows according a time window:
# You can remove pd.to_datetime if 'Date' and 'Hazard_start' are already datetime64
# Adjust the windows, here 3 days
between_dates = lambda x: pd.to_datetime(x['Date']).sub(pd.to_datetime(x['Hazard_start'])).abs() < pd.Timedelta(days=3)
out = flood_df.merge(surge_df, how='cross').loc[between_dates]