Is it possible to drop not exact/ similar duplicates?
E.g In this case I'd like to drop the duplicates within 0.1 lat and long.
site_df_1 = pd.DataFrame(np.array([["12345", "Wrexham Cwtch", "52.10", "-2.06"], ["12354", "Horse & Hound", "52.21", "-1.95"], ["12435", "Round Of Gras Badsey", "52.33", "-1.99"]]),
columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])
site_df_2 = pd.DataFrame(np.array([["52938", "Valkyrie Café Bar", "53.22", "-3.00"], ["12435", "Round Of Badsey", "52.33", "-1.99"], ["12345", "Cwtch", "52.11", "-2.00"]]),
columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])
matched_sites = pd.DataFrame(np.array([["12345", "Wrexham Cwtch", "52.10", "-2.06"], ["12435", "Round Of Gras Badsey", "52.33", "-1.99"], ["52938", "Valkyrie Café Bar", "53.22", "-3.00"]]),
columns=['Site Number', 'Site Name', 'Longitude', 'Latitude'])
Thanks
CodePudding user response:
IIUC, you can compute the distances with scipy.spatial.distance.cdist
and use them to drop the rows of one of the DataFrames for concat
:
from scipy.spatial.distance import cdist
threshold = 0.1
# longitude within threshold
d1 = cdist(site_df_1[['Longitude']].astype(float), site_df_2[['Longitude']].astype(float)) < threshold
# latitude within threshold
d2 = cdist(site_df_1[['Latitude']].astype(float), site_df_2[['Latitude']].astype(float)) < threshold
# both within threshold
mask = d1&d2
# array([[False, False, True],
# [False, False, False],
# [False, True, False]])
# drop "duplicated" coordinates and concat
out = pd.concat([site_df_1, site_df_2.loc[~mask.any(axis=0)]])
print(out)
NB. be aware geographic coordinates are not linear, so a difference of 0.1 will me much smaller close to the poles than close to the Equator.
Output:
Site Number Site Name Longitude Latitude
0 12345 Wrexham Cwtch 52.10 -2.06
1 12354 Horse & Hound 52.21 -1.95
2 12435 Round Of Gras Badsey 52.33 -1.99
0 52938 Valkyrie Café Bar 53.22 -3.00
CodePudding user response:
Yes, it is possible to drop not exact/similar duplicates in Python. To do this, you can use the pandas library and its DataFrame.duplicated() method. This method allows you to specify a custom comparison function that determines whether two rows are considered duplicates. For example:
import pandas as pd
def compare_rows(row1, row2, threshold=0.1):
diff_lat = abs(row1['Latitude'] - row2['Latitude'])
diff_lon = abs(row1['Longitude'] - row2['Longitude'])
return diff_lat < threshold and diff_lon < threshold
site_df_1 = pd.DataFrame(...)
site_df_2 = pd.DataFrame(...)
sites = pd.concat([site_df_1, site_df_2])
matched_sites = sites.drop_duplicates(subset=['Site Number', 'Site Name', 'Longitude', 'Latitude'], keep='first', inplace=False, ignore_index=True, compare=compare_rows)