Is there a faster way to rebuild a dataframe based on certain values of rows?-CodePudding

I loaded a .csv file with around 620k rows and 6 columns into jupyter notebook. The data is like this:

col_1   col_2   col_3   col_4       col_5
ID_1    388343  388684  T.45396D    2.400000e-03
ID_1    388343  388684  T.45708S    3.400000e-04
ID_1    388343  388684  T.48892G    2.200000e-10
ID_1    388343  388684  T.56898F    1.900000e-21
ID_1    388343  388684  T.64122D    2.300000e-04

I need to rebuild the table such that the ID (col_1) is unique with the smallest value of (col_5). What I've done is:

for i in unique_col_1:
    index = data[(data['col_1'] == i)].index
    min_value = data.col_5.iloc[index].min()
    index = data[ (data['col_1'] == i) & (data['col_5'] != min_value) ].index
    data.drop(index, inplace=True)

but this is too slow which the processing speed is around 6.5 it/s in my machine, and 8 it/s when I run it on google colaboratory.

Is there any better way to do this in faster time?

CodePudding user response：

might not be the fastest possible implementation, but it is certainly faster than looping over all values of col_1 and iteratively dropping it.

df.sort_values("col_5").drop_duplicates(subset="col_1", keep=First)

there are two major performance considerations at issue with your implementation:

vectorization:
pandas functions such as sort_values, drop_duplicates, and other operations are written in cython (a python extension library which builds compiled modules which run in C or C ). These functions are hundreds or thousands of times faster than python code written with for loops for large datasets. so whenever possible, use built in pandas operators on the whole array at once rather than looping over the data yourself.
iterative array resizing:
pandas is built on numpy, and uses continuous arrays in memory to store columns of numeric data. Allocating these arrays is (relatively) slow; performing operations on them is fast. When you resize an array, you need to re-allocate again and copy the data to the new resized array. So when you loop over an array and in each iteration do something like drop or append (which has been deprecated for exactly this reason), you're re-allocating the entire dataframe's array in every iteration. better would be to build a list of array indices you want to drop and then drop them all once at the end of the loop; best is to use a vectorized solution and skip the for loop in the first place.