Optimize For Loop goes over a unique IDs to remove Sport after target Sport-CodePudding

I hope you can help me optimize a For loop that goes over unique IDs.

I have a data frame with an Index, ID, and Sport as follows:

I want to remove all sports after our target sport which is Volleyball (keep only sport before our target), the output will be like this:

The data is sorted out by index.

I tried to use a For loop, but when I have a large data set, I have a large amount of time for execution

df = pd.read_csv('data.csv')
#Target sport
TargetArgs={'End_sport':'Volleyball'}
#empty dataframe to append all the dataframes from the loop
df_selected = pd.DataFrame()
# select unique IDs to use them in the loop
unique_id = df['ID'].unique()
for i in unique_id:
    # create df for each ID a dataFrame
    df_id = df[df['ID']==i]
    # select the last index where the target sport is last seen
    last_index = df_id['Sport'].where(df_id['Sport'] == TargetArgs.get('End_sport')).last_valid_index()
    # select the first index of an ID has first occurs
    first_index = df_id['ID'].where(df_id['ID']== i).first_valid_index()
    # create a dataframe which is the selection from first index and last index 
    df_for_id = df_id.loc[first_index:last_index]
    # append dataframe for each id
    df_selected = df_selected.append(df_for_id)

CodePudding user response：

Dataframes are not meant to be processed with for loops. It's like taking individual bristles of your toothbrush and brushing your teeth with them one by one. You can achieve what you want with a combination of optimized, pandas-internal methods like groupby, loc and idxmax.

import pandas as pd
import numpy as np

# set up test data
data = {'ID': [1, 1, 1, 2, 2, 2, 3],
       'Sport': ['Volleyball',
                 'Foot',
                 'Basket',
                 'Tennis',
                 'Volleyball',
                 'Swimming',
                 'Volleyball']}

df = pd.DataFrame(data, index=np.arange(7)   1)

	ID	Sport
1	1	Volleyball
2	1	Foot
3	1	Basket
4	2	Tennis
5	2	Volleyball
6	2	Swimming
7	3	Volleyball

For each ID group, get all rows up to the first occurrence of "Volleyball" in the Sport column and reset the index:

df2 = (df.groupby("ID")
       .apply(lambda x:x.loc[:x.Sport.eq("Volleyball").idxmax()])
       .reset_index(level=0, drop=True)
       )

Result:

	ID	Sport
1	1	Volleyball
4	2	Tennis
5	2	Volleyball
7	3	Volleyball