I have a large df (14*1'000'000) and I want to subset it. The calculation seems to take unsurprisingly a lot of time though and I wonder how to improve the speed.
What I want is to subset for each Name
the lowest value of Total_time
while ignoring zero values and picking only the first one if there is more than one row has the lowest value of Total_time
. And then I want it to be all appended into a new dataframe unique
.
Is there a general mistake in my code that makes it inefficient?
unique = pd.DataFrame([])
i=0
for pair in df['Name'].unique():
i=i 1
temp =df[df["Name"]== pair]
temp2 = temp.loc[df['Total_time'] != 0]
lowest = temp2['Total_time'].min()
temp3 = temp2[temp2["Total_time"] == lowest].head(1)
unique = unique.append(temp3)
print("finished " pair " " str(i))
CodePudding user response:
in general, you don't want to iterate over each item.
if you want the Name with the smallest time:
new_df = df[df["Total_time"] != 0].copy() # you seem to be throwing away 0
out = new_df.groupby("Name")["Total_time"].min()
If you need the rest of the columns:
new_df.loc[new_df.groupby("Name")["total_time"].idxmin()]
CodePudding user response:
What I want is to subset for each
Name
the lowest value ofTotal_time
while ignoring zero values and picking only the first one if there is more than one row has the lowest value ofTotal_time
.
This sounds like task for pandas.Series.idxmin
consider following simple example
import pandas as pd
df = pd.DataFrame({"X":["A","B","C","D","E"],"Y":[5.5,0.0,5.5,1.5,1.5]})
first_min = df.Y.replace(0,float("nan")).idxmin()
print(df.iloc[first_min])
output
X D
Y 1.5
Name: 3, dtype: object
Explanation: replace 0 with NaN so they are not considered, then use idxmin to get index of 1st minimum, which might be used with .iloc
.