Calculate distances and find closest location to the location of flooding-CodePudding

I have two datasets with different lengths (see below) with lat/lon for different dates. I want to use each location of flooding and calculate distance from that to all storm surge locations for each date and select the location of smallest distance. For example: in first dataset I have one entry for 1979-01-24 and in second dataset I have five entries for same date. I want to use the location coor (flood) from the first dataset and calculate distances to each entries in second dataset and then select the location (with all data) which is closest to the coor (flood).

flood_df.head()

   lon_flood  lat_flood  Percentiles_moisture_flood  Percentiles_precip_flood  Percentiles_discharge_flood  Sum_perc_flood           coor (flood) Hazard_start
0  -80.78125   25.15625                    0.777194                  0.849623                     0.960600        2.587417  (-80.78125, 25.15625)   1979-01-24
1  -80.65625   25.28125                    0.694346                  0.960126                     0.980657        2.635129  (-80.65625, 25.28125)   1979-04-29
2  -80.40625   25.28125                    0.974628                  0.974852                     0.933265        2.882745  (-80.40625, 25.28125)   1979-09-03
3  -80.40625   25.28125                    0.994069                  0.995098                     0.995548        2.984716  (-80.40625, 25.28125)   1979-09-12
4  -80.40625   25.28125                    0.895422                  0.998689                     0.997343        2.891454  (-80.40625, 25.28125)   1981-08-16

surge_df.head()

   latitude  longitude  surge        Date  Percentiles       coor (surge)
0    25.737    -80.142   96.0  1979-01-24     0.981429  (-80.142, 25.737)
1    25.591    -80.142   91.0  1979-01-24     0.978333  (-80.142, 25.591)
2    25.942    -80.112  103.0  1979-01-24     0.985455  (-80.112, 25.942)
3    25.415    -80.171   88.0  1979-01-24     0.977000  (-80.171, 25.415)
4    25.327    -80.229   95.0  1979-01-24     0.982143  (-80.229, 25.327)

I am using following code to calculate the distances and it's giving me the distances but I also want to have their associated percentiles of surge and other rows.

# libraries
import numpy as np
import pandas as pd

# progress bar
from tqdm.notebook import tqdm, trange

# dealing with date and time
from datetime import datetime as dt
from datetime import datetime, timedelta

# for calculating distance
import haversine as hs

# calculate distance
distance_coord_flood_surge = []

for i, r in zip(locations_flood_surge['coor (flood)'], locations_flood_surge['coor (surge)'] ):
    # print(i, r)
    distance = hs.haversine(i, r)
    distance_coord_flood_surge.append(distance)

I also tried as below but still I am not getting all the information.

# find nearest storm surge location to flood location
# dataframe 
surge_flood_dist_df = pd.DataFrame()
        
# get the beginning and ending date of an event
for i in tqdm(zip(range(len(events_dates_2018)))):
    start_date = events_dates_2018['Hazard_start'].iloc[i]
    end_date = events_dates_2018['Hazard_end'].iloc[i]
    
    #lat and lon
    # coord_flood = flood_vars_output_df['coor (flood)']
 
    # get the first and last date of the window
    first_window_date = start_date - timedelta(days=1)
    last_window_date = end_date   timedelta(days=1)
    # get the date range of the window
    window_dates = pd.date_range(first_window_date, last_window_date)
    
    # get the values of the window
    window_values_surge = surge_df.loc[surge_df.index.intersection(window_dates)]

    window_values_flood = flood_df.loc[flood_vars_output_df.index.intersection(window_dates)]

    # get the closest storm surge location to the flood location
    for j, k in zip(window_values_flood['coor (flood)'], window_values_surge['coor (surge)']):
        distance = hs.haversine(j, k)

        # select the closest storm surge location to the flood location and append to the dataframe 
        surge_flood_dist_df.append(distance)

Could you please help me on how to do it? I will appreciate any help. Thank you.

CodePudding user response：

IIUC, merge your two dataframes on date columns then compute the haversine distance and keep only the closest distance for a same date:

from haversine import haversine_vector

# Merge
out = flood_df.merge(surge_df, left_on='Hazard_start', right_on='Date', how='left')

# Haversine
out['haversine'] = haversine_vector(out[['lon_flood', 'lat_flood']], 
                                    out[['longitude', 'latitude']])

# Select the closest
out = out.loc[out.groupby('Hazard_start', dropna=False)['haversine']
                 .rank(method='dense').fillna(1)
                 .loc[lambda x: x==1].index]

Output:

>>> out
   lon_flood  lat_flood  Percentiles_moisture_flood  Percentiles_precip_flood  Percentiles_discharge_flood  Sum_perc_flood           coor (flood) Hazard_start  latitude  longitude  surge        Date  Percentiles       coor (surge)  haversine
4  -80.78125   25.15625                    0.777194                  0.849623                     0.960600        2.587417  (-80.78125, 25.15625)   1979-01-24    25.327    -80.229   95.0  1979-01-24     0.982143  (-80.229, 25.327)  61.487236
5  -80.65625   25.28125                    0.694346                  0.960126                     0.980657        2.635129  (-80.65625, 25.28125)   1979-04-29       NaN        NaN    NaN         NaN          NaN                NaN        NaN
6  -80.40625   25.28125                    0.974628                  0.974852                     0.933265        2.882745  (-80.40625, 25.28125)   1979-09-03       NaN        NaN    NaN         NaN          NaN                NaN        NaN
7  -80.40625   25.28125                    0.994069                  0.995098                     0.995548        2.984716  (-80.40625, 25.28125)   1979-09-12       NaN        NaN    NaN         NaN          NaN                NaN        NaN
8  -80.40625   25.28125                    0.895422                  0.998689                     0.997343        2.891454  (-80.40625, 25.28125)   1981-08-16       NaN        NaN    NaN         NaN          NaN                NaN        NaN

Update

Is there a way to use a time window on surge_df based on flood_df (such as allowing few days before and after the Hazard_start) to calculate the distances within the window and select values with minimum distance.

Use how='cross' as parameter of merge to create the cartesian product of dataframes then filter rows according a time window:

# You can remove pd.to_datetime if 'Date' and 'Hazard_start' are already datetime64
# Adjust the windows, here 3 days
between_dates = lambda x: pd.to_datetime(x['Date']).sub(pd.to_datetime(x['Hazard_start'])).abs() < pd.Timedelta(days=3)

out = flood_df.merge(surge_df, how='cross').loc[between_dates]