I have a problem with drop_duplicates
in a pandas
dataframe. I'm importing lots of mixed data from an excel file into a dataframe and then doing various things to clean up the data. One of the stages is to remove any duplicates based on their coordinates.
In general this is working pretty well and importantly it's very fast, but I've had some problems and after an extensive search of the dataset I've found out that pandas
is always a few duplicates.
Here's my test dataset:
x y z radius scale type
0 128.798699 76.038331 0.000 1.172 1.000 Node_B
1 136.373699 78.068331 0.000 1.172 1.000 Node_B
2 133.171699 74.866331 0.000 1.172 1.000 Node_B
3 135.201699 76.038331 0.000 1.172 1.000 Node_B
4 135.201699 82.442331 0.000 1.172 1.000 Node_B
5 136.373699 80.412331 0.000 1.172 1.000 Node_B
6 133.171699 83.614331 0.000 1.172 1.000 Node_B
7 127.626699 78.068331 0.000 1.172 1.000 Node_B
8 131.999699 79.240331 0.000 2.750 1.000 Node_A
9 90.199699 94.795331 0.626 0.325 0.650 Rib_B
10 85.799699 95.445331 0.626 0.325 0.650 Rib_B
11 90.199699 95.445331 0.626 0.325 0.650 Rib_B
12 91.865699 95.557331 0.537 0.438 0.876 Rib_B
13 128.798699 82.442331 0.000 1.172 1.000 Node_B
14 136.373699 80.412331 0.000 1.172 1.000 Node_B
15 158.373699 38.448331 0.000 1.172 1.000 Node_B
16 152.827699 35.246331 0.000 1.172 1.000 Node_B
17 157.201699 36.418331 0.000 1.172 1.000 Node_B
18 155.171699 35.246331 0.000 1.172 1.000 Node_B
19 215.626699 80.412331 0.000 1.172 1.000 Node_B
20 218.827699 83.614331 0.000 1.172 1.000 Node_B
21 216.798699 82.442331 0.000 1.172 1.000 Node_B
22 131.999699 79.240331 0.000 2.750 1.000 Node_A
23 128.798699 76.038331 0.000 1.172 1.000 Node_B
24 136.373699 78.068331 0.000 1.172 1.000 Node_B
25 162.051699 70.180331 0.626 0.325 0.650 Rib_D
26 162.619699 70.496331 0.626 0.325 0.650 Rib_D
27 189.948699 70.180331 0.626 0.325 0.650 Rib_D
I'm finding duplicates based on the x,y,z
coordinates as these should be unique locations so I use df.drop_duplicates(subset=['x', 'y', 'z'], inplace=True)
to remove any duplicates from the data frame. This seems to remove about 90% of my duplicates but it always seem to be missing some.
In the example dataframe there are several duplicates [0==23, 1==24, 6==14, 8==22] but pandas
fails to remove them.
I found this using numpy
and a very slow iterative loop that is comparing every point to every other point. It's ok for 50 or 100 points, but takes 15-20 minutes when I have 100-200K records in the dataframe.
How do I fix this? There is no precision parameter for drop_duplicates
so why does it miss some?
CodePudding user response:
You can use round
as suggested by @mozway:
PRECISION = 3
df.drop(df[['x', 'y', 'z']].round(PRECISION).duplicated().loc[lambda x: x].index, inplace=True)
print(df)
# Output
x y z radius scale type
0 128.798699 76.038331 0.000 1.172 1.000 Node_B
1 136.373699 78.068331 0.000 1.172 1.000 Node_B
2 133.171699 74.866331 0.000 1.172 1.000 Node_B
3 135.201699 76.038331 0.000 1.172 1.000 Node_B
4 135.201699 82.442331 0.000 1.172 1.000 Node_B
5 136.373699 80.412331 0.000 1.172 1.000 Node_B
6 133.171699 83.614331 0.000 1.172 1.000 Node_B
7 127.626699 78.068331 0.000 1.172 1.000 Node_B
8 131.999699 79.240331 0.000 2.750 1.000 Node_A
9 90.199699 94.795331 0.626 0.325 0.650 Rib_B
10 85.799699 95.445331 0.626 0.325 0.650 Rib_B
11 90.199699 95.445331 0.626 0.325 0.650 Rib_B
12 91.865699 95.557331 0.537 0.438 0.876 Rib_B
13 128.798699 82.442331 0.000 1.172 1.000 Node_B
15 158.373699 38.448331 0.000 1.172 1.000 Node_B
16 152.827699 35.246331 0.000 1.172 1.000 Node_B
17 157.201699 36.418331 0.000 1.172 1.000 Node_B
18 155.171699 35.246331 0.000 1.172 1.000 Node_B
19 215.626699 80.412331 0.000 1.172 1.000 Node_B
20 218.827699 83.614331 0.000 1.172 1.000 Node_B
21 216.798699 82.442331 0.000 1.172 1.000 Node_B
25 162.051699 70.180331 0.626 0.325 0.650 Rib_D
26 162.619699 70.496331 0.626 0.325 0.650 Rib_D
27 189.948699 70.180331 0.626 0.325 0.650 Rib_D