Home > Back-end >  Pandas drop_duplicates not finding all duplicates
Pandas drop_duplicates not finding all duplicates

Time:03-18

I have a problem with drop_duplicates in a pandas dataframe. I'm importing lots of mixed data from an excel file into a dataframe and then doing various things to clean up the data. One of the stages is to remove any duplicates based on their coordinates.

In general this is working pretty well and importantly it's very fast, but I've had some problems and after an extensive search of the dataset I've found out that pandas is always a few duplicates.

Here's my test dataset:

             x          y      z  radius  scale    type
0   128.798699  76.038331  0.000   1.172  1.000  Node_B
1   136.373699  78.068331  0.000   1.172  1.000  Node_B
2   133.171699  74.866331  0.000   1.172  1.000  Node_B
3   135.201699  76.038331  0.000   1.172  1.000  Node_B
4   135.201699  82.442331  0.000   1.172  1.000  Node_B
5   136.373699  80.412331  0.000   1.172  1.000  Node_B
6   133.171699  83.614331  0.000   1.172  1.000  Node_B
7   127.626699  78.068331  0.000   1.172  1.000  Node_B
8   131.999699  79.240331  0.000   2.750  1.000  Node_A
9    90.199699  94.795331  0.626   0.325  0.650   Rib_B
10   85.799699  95.445331  0.626   0.325  0.650   Rib_B
11   90.199699  95.445331  0.626   0.325  0.650   Rib_B
12   91.865699  95.557331  0.537   0.438  0.876   Rib_B
13  128.798699  82.442331  0.000   1.172  1.000  Node_B
14  136.373699  80.412331  0.000   1.172  1.000  Node_B
15  158.373699  38.448331  0.000   1.172  1.000  Node_B
16  152.827699  35.246331  0.000   1.172  1.000  Node_B
17  157.201699  36.418331  0.000   1.172  1.000  Node_B
18  155.171699  35.246331  0.000   1.172  1.000  Node_B
19  215.626699  80.412331  0.000   1.172  1.000  Node_B
20  218.827699  83.614331  0.000   1.172  1.000  Node_B
21  216.798699  82.442331  0.000   1.172  1.000  Node_B
22  131.999699  79.240331  0.000   2.750  1.000  Node_A
23  128.798699  76.038331  0.000   1.172  1.000  Node_B
24  136.373699  78.068331  0.000   1.172  1.000  Node_B
25  162.051699  70.180331  0.626   0.325  0.650   Rib_D
26  162.619699  70.496331  0.626   0.325  0.650   Rib_D
27  189.948699  70.180331  0.626   0.325  0.650   Rib_D

I'm finding duplicates based on the x,y,z coordinates as these should be unique locations so I use df.drop_duplicates(subset=['x', 'y', 'z'], inplace=True) to remove any duplicates from the data frame. This seems to remove about 90% of my duplicates but it always seem to be missing some.

In the example dataframe there are several duplicates [0==23, 1==24, 6==14, 8==22] but pandas fails to remove them.

I found this using numpy and a very slow iterative loop that is comparing every point to every other point. It's ok for 50 or 100 points, but takes 15-20 minutes when I have 100-200K records in the dataframe.

How do I fix this? There is no precision parameter for drop_duplicates so why does it miss some?

CodePudding user response:

You can use round as suggested by @mozway:

PRECISION = 3

df.drop(df[['x', 'y', 'z']].round(PRECISION).duplicated().loc[lambda x: x].index, inplace=True)
print(df)

# Output
             x          y      z  radius  scale    type
0   128.798699  76.038331  0.000   1.172  1.000  Node_B
1   136.373699  78.068331  0.000   1.172  1.000  Node_B
2   133.171699  74.866331  0.000   1.172  1.000  Node_B
3   135.201699  76.038331  0.000   1.172  1.000  Node_B
4   135.201699  82.442331  0.000   1.172  1.000  Node_B
5   136.373699  80.412331  0.000   1.172  1.000  Node_B
6   133.171699  83.614331  0.000   1.172  1.000  Node_B
7   127.626699  78.068331  0.000   1.172  1.000  Node_B
8   131.999699  79.240331  0.000   2.750  1.000  Node_A
9    90.199699  94.795331  0.626   0.325  0.650   Rib_B
10   85.799699  95.445331  0.626   0.325  0.650   Rib_B
11   90.199699  95.445331  0.626   0.325  0.650   Rib_B
12   91.865699  95.557331  0.537   0.438  0.876   Rib_B
13  128.798699  82.442331  0.000   1.172  1.000  Node_B
15  158.373699  38.448331  0.000   1.172  1.000  Node_B
16  152.827699  35.246331  0.000   1.172  1.000  Node_B
17  157.201699  36.418331  0.000   1.172  1.000  Node_B
18  155.171699  35.246331  0.000   1.172  1.000  Node_B
19  215.626699  80.412331  0.000   1.172  1.000  Node_B
20  218.827699  83.614331  0.000   1.172  1.000  Node_B
21  216.798699  82.442331  0.000   1.172  1.000  Node_B
25  162.051699  70.180331  0.626   0.325  0.650   Rib_D
26  162.619699  70.496331  0.626   0.325  0.650   Rib_D
27  189.948699  70.180331  0.626   0.325  0.650   Rib_D
  • Related