is it possible to drop not exact duplicates?-CodePudding

Is it possible to drop not exact/ similar duplicates?

E.g In this case I'd like to drop the duplicates within 0.1 lat and long.

site_df_1 = pd.DataFrame(np.array([["12345", "Wrexham Cwtch", "52.10", "-2.06"], ["12354", "Horse & Hound", "52.21", "-1.95"], ["12435", "Round Of Gras Badsey", "52.33", "-1.99"]]),
                   columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])

site_df_2 = pd.DataFrame(np.array([["52938", "Valkyrie Café Bar", "53.22", "-3.00"], ["12435", "Round Of Badsey", "52.33", "-1.99"], ["12345", "Cwtch", "52.11", "-2.00"]]),
                   columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])

matched_sites = pd.DataFrame(np.array([["12345", "Wrexham Cwtch", "52.10", "-2.06"], ["12435", "Round Of Gras Badsey", "52.33", "-1.99"], ["52938", "Valkyrie Café Bar", "53.22", "-3.00"]]),
                   columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])

Thanks

CodePudding user response：

IIUC, you can compute the distances with scipy.spatial.distance.cdist and use them to drop the rows of one of the DataFrames for concat:

from scipy.spatial.distance import cdist

threshold = 0.1

# longitude within threshold
d1 = cdist(site_df_1[['Longitude']].astype(float), site_df_2[['Longitude']].astype(float)) < threshold
# latitude within threshold
d2 = cdist(site_df_1[['Latitude']].astype(float), site_df_2[['Latitude']].astype(float)) < threshold

# both within threshold
mask = d1&d2
# array([[False, False,  True],
#        [False, False, False],
#        [False,  True, False]])

# drop "duplicated" coordinates and concat
out = pd.concat([site_df_1, site_df_2.loc[~mask.any(axis=0)]])

print(out)

NB. be aware geographic coordinates are not linear, so a difference of 0.1 will me much smaller close to the poles than close to the Equator.

Output:

  Site Number             Site Name Longitude Latitude
0       12345         Wrexham Cwtch     52.10    -2.06
1       12354         Horse & Hound     52.21    -1.95
2       12435  Round Of Gras Badsey     52.33    -1.99
0       52938     Valkyrie Café Bar     53.22    -3.00

CodePudding user response：

Yes, it is possible to drop not exact/similar duplicates in Python. To do this, you can use the pandas library and its DataFrame.duplicated() method. This method allows you to specify a custom comparison function that determines whether two rows are considered duplicates. For example:

import pandas as pd

def compare_rows(row1, row2, threshold=0.1):
    diff_lat = abs(row1['Latitude'] - row2['Latitude'])
    diff_lon = abs(row1['Longitude'] - row2['Longitude'])
    
    return diff_lat < threshold and diff_lon < threshold

site_df_1 = pd.DataFrame(...)
site_df_2 = pd.DataFrame(...)

sites = pd.concat([site_df_1, site_df_2])

matched_sites = sites.drop_duplicates(subset=['Site Number', 'Site Name', 'Longitude', 'Latitude'], keep='first', inplace=False, ignore_index=True, compare=compare_rows)