Check whether Pandas DataFrame column contains duplicates within floating point precision-CodePudding

I have a dataframe, df, of xyz points with 4 columns, x, y, z, and Classification , all containing floats. The dataframe was created by concatenating two other dataframes, df1 and df2.

All of the points in df2 are a subset of those in df1, but the two dataframes underwent different processing in different software. All of the points in df2 have a classification of 14. None of the points in df1 are of class 14. Thus in df there are points that are essentially xyz duplicates (the number of these xyz duplicates is len(df2)) half of which are class 14. I want to find these duplicates and discard those that are not class 14.

I say the rows are essentially xyz duplicates because floating point error has been introduced to many of the rows during the previous processing.

>>> # make example data,
>>> # small differences introduced on 2nd data
>>> x1 = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
>>> x2 = x1[:4]   np.array([0, 0, 3e-13, 0])

>>> y1 = np.array([10.0, 11.01, 12.0, 13.0, 14.0, 15.0, 16.0])
>>> y2 = y1[:4]   np.array([0, 0, 3e-13, 4e-13])

>>> z1 = np.array([10.0, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6])
>>> z2 = z1[:4]   np.array([0.0, 0.0, 3e-13, 4e-13])

>>> df1 = pd.DataFrame(columns=['x', 'y', 'z', 'Classification'])
>>> df1['x'] = x1
>>> df1['y'] = y1
>>> df1['z'] = z1
>>> df1['Classifcation'] = 0

>>> df2 = pd.DataFrame(columns=['x', 'y', 'z', 'Classification'])
>>> df2['x'] = x2
>>> df2['y'] = y2
>>> df2['z'] = z2
>>> df2['Classifcation'] = 14

>>> df1
     x      y     z  Classification
0  0.0  10.00  10.0               0
1  1.0  11.01  10.1               0
2  2.0  12.00  10.2               0
3  3.0  13.00  10.3               0
4  4.0  14.00  10.4               0
5  5.0  15.00  10.5               0
6  6.0  16.00  10.6               0

>>> df2
     x    y             z  Classification
0  0.0  0.0  3.000000e-13              14
1  1.0  1.0  1.000000e 00              14
2  2.0  2.0  2.000000e 00              14
3  3.0  3.0  3.000000e 00              14

>>> df = pd.concat ([df1, df2], axis=0)
>>> df
     x      y             z  Classification
0  0.0  10.00  1.000000e 01               0
1  1.0  11.01  1.010000e 01               0
2  2.0  12.00  1.020000e 01               0
3  3.0  13.00  1.030000e 01               0
4  4.0  14.00  1.040000e 01               0
5  5.0  15.00  1.050000e 01               0
6  6.0  16.00  1.060000e 01               0
0  0.0   0.00  3.000000e-13              14
1  1.0   1.00  1.000000e 00              14
2  2.0   2.00  2.000000e 00              14
3  3.0   3.00  3.000000e 00              14

Originally I tried

>>> df0 = df.loc[~((df.duplicated(subset=['x', 'y', 'z'],
                          keep=False))
                 & (df.Classification != 14))]

>>> df0
     x      y     z  Classification
2  2.0  12.00  10.2               0
3  3.0  13.00  10.3               0
4  4.0  14.00  10.4               0
5  5.0  15.00  10.5               0
6  6.0  16.00  10.6               0
0  0.0  10.00  10.0              14
1  1.0  11.01  10.1              14
2  2.0  12.00  10.2              14
3  3.0  13.00  10.3              14

>>> len(df), len(df0)
(11, 9)

This discarded all of the non-14-classifed exact duplicates, but missed the close (in x, y, and z) duplicates created by the floating point error.

I need to do something like df.duplicated but with the behavior of numpy.isclose so that values within a certain tolerance will be considered duplicates.

Thanks

CodePudding user response：

one way to do this is to remove the duplicates before the concat, as I understand you are trying to remove rows in df1 that are already in df2. The trick here is to use isin and create a key that can compare two data frames easily, and you can use string format and set decimal points (example below uses 5 points).

df_1['key'] = df_1.apply(lambda row: f'{row["X"]:.5f}-{row["Y"]:.5f}-{row["Z"]:.5f}', axis=1)
df_2['key'] = df_2.apply(lambda row: f'{row["X"]:.5f}-{row["Y"]:.5f}-{row["Z"]:.5f}', axis=1)
df_1 = df_1[df_1['key'].isin(df_2['key'])==False]
combined = pd.concat([df_1, df_2])