I loaded a .csv
file with around 620k rows and 6 columns into jupyter notebook. The data
is like this:
col_1 col_2 col_3 col_4 col_5
ID_1 388343 388684 T.45396D 2.400000e-03
ID_1 388343 388684 T.45708S 3.400000e-04
ID_1 388343 388684 T.48892G 2.200000e-10
ID_1 388343 388684 T.56898F 1.900000e-21
ID_1 388343 388684 T.64122D 2.300000e-04
I need to rebuild the table such that the ID (col_1)
is unique with the smallest value of (col_5)
. What I've done is:
for i in unique_col_1:
index = data[(data['col_1'] == i)].index
min_value = data.col_5.iloc[index].min()
index = data[ (data['col_1'] == i) & (data['col_5'] != min_value) ].index
data.drop(index, inplace=True)
but this is too slow which the processing speed is around 6.5 it/s in my machine, and 8 it/s when I run it on google colaboratory.
Is there any better way to do this in faster time?
CodePudding user response:
might not be the fastest possible implementation, but it is certainly faster than looping over all values of col_1 and iteratively dropping it.
df.sort_values("col_5").drop_duplicates(subset="col_1", keep=First)
there are two major performance considerations at issue with your implementation:
- vectorization:
pandas functions such assort_values
,drop_duplicates
, and other operations are written incython
(a python extension library which builds compiled modules which run in C or C ). These functions are hundreds or thousands of times faster than python code written with for loops for large datasets. so whenever possible, use built in pandas operators on the whole array at once rather than looping over the data yourself. - iterative array resizing:
pandas is built on numpy, and uses continuous arrays in memory to store columns of numeric data. Allocating these arrays is (relatively) slow; performing operations on them is fast. When you resize an array, you need to re-allocate again and copy the data to the new resized array. So when you loop over an array and in each iteration do something likedrop
orappend
(which has been deprecated for exactly this reason), you're re-allocating the entire dataframe's array in every iteration. better would be to build a list of array indices you want to drop and then drop them all once at the end of the loop; best is to use a vectorized solution and skip the for loop in the first place.