I hope you can help me optimize a For loop that goes over unique IDs.
I have a data frame with an Index, ID, and Sport as follows:
I want to remove all sports after our target sport which is Volleyball (keep only sport before our target), the output will be like this:
The data is sorted out by index.
I tried to use a For loop, but when I have a large data set, I have a large amount of time for execution
df = pd.read_csv('data.csv')
#Target sport
TargetArgs={'End_sport':'Volleyball'}
#empty dataframe to append all the dataframes from the loop
df_selected = pd.DataFrame()
# select unique IDs to use them in the loop
unique_id = df['ID'].unique()
for i in unique_id:
# create df for each ID a dataFrame
df_id = df[df['ID']==i]
# select the last index where the target sport is last seen
last_index = df_id['Sport'].where(df_id['Sport'] == TargetArgs.get('End_sport')).last_valid_index()
# select the first index of an ID has first occurs
first_index = df_id['ID'].where(df_id['ID']== i).first_valid_index()
# create a dataframe which is the selection from first index and last index
df_for_id = df_id.loc[first_index:last_index]
# append dataframe for each id
df_selected = df_selected.append(df_for_id)
CodePudding user response:
Dataframes are not meant to be processed with for loops. It's like
taking individual bristles of your toothbrush and brushing your teeth with them one by one. You can achieve what you want with a combination of optimized, pandas-internal methods like groupby
, loc
and idxmax
.
import pandas as pd
import numpy as np
# set up test data
data = {'ID': [1, 1, 1, 2, 2, 2, 3],
'Sport': ['Volleyball',
'Foot',
'Basket',
'Tennis',
'Volleyball',
'Swimming',
'Volleyball']}
df = pd.DataFrame(data, index=np.arange(7) 1)
ID | Sport | |
---|---|---|
1 | 1 | Volleyball |
2 | 1 | Foot |
3 | 1 | Basket |
4 | 2 | Tennis |
5 | 2 | Volleyball |
6 | 2 | Swimming |
7 | 3 | Volleyball |
For each ID
group, get all rows up to the first occurrence of "Volleyball"
in the Sport
column and reset the index:
df2 = (df.groupby("ID")
.apply(lambda x:x.loc[:x.Sport.eq("Volleyball").idxmax()])
.reset_index(level=0, drop=True)
)
Result:
ID | Sport | |
---|---|---|
1 | 1 | Volleyball |
4 | 2 | Tennis |
5 | 2 | Volleyball |
7 | 3 | Volleyball |