Track a customer status in a list over time and check if they have an ID present : Pandas-CodePudding

I have a pandas dataframe as shown below :

PA_ID    Date                  Name   Status      Age   ID 
100      10/06/2021 13:32      Ash    Clean       40    1001
101      10/06/2021 13:33      Ash    Not Clean   40    NaN
102      10/07/2021 14:33      Eva    Clean       35    NaN
103      10/07/2021 15:33      Eva    Clean       35    NaN
104      10/08/2021 15:33      Jon    Clean       25    1210
105      10/08/2021 15:33      James  Clean       25    NaN
106      10/08/2021 15:33      James  Clean       25    NaN

I want to track the status of each person and put that into a list as a new column "Tracking" and then merge it with my original dataframe, similarly I will be check if any of the PA_ID correspond to a person have any ID present, if yes; then ID_Present is yes , else no.

This is how I am doing it

# Part one to track the customer status

df1 = df.groupby('Name')['Status'].apply(list).reset_index(name='Tracking')
df2 = df.merge(df1,how='inner',on='Name')

# Part two ; checking if ID is present

df3 = df.groupby('Name')['ID'].apply(list).reset_index(name='Mylist')
df3['Mylist'] = df3[['Mylist']].applymap(lambda x: [*filter(pd.notna, x)])
df3['ID_Present'] = df3['Mylist'].apply(lambda y: 'no' if len(y)==0 else 'yes')

# Fetching the final dataset
 
final = df2.merge(df3[['Name','ID_Present']],how='inner',on='Name')

Is there an efficient way to do this, less code and fast execution. I am pretty sure there are some pandas methods that I can use to do this cleanly but my knowledge with pandas is limited. thanks in advance

Expected Output:

PA_ID    Date                  Name   Status      Age   ID     Tracking              ID_Present
100      10/06/2021 13:32      Ash    Clean       40    1001   [Clean, Not Clean]     yes
101      10/06/2021 13:33      Ash    Not Clean   40           [Clean, Not Clean]     yes
102      10/07/2021 14:33      Eva    Clean       35           [Clean,  Clean]        no
103      10/07/2021 15:33      Eva    Clean       35           [Clean,  Clean]        no
104      10/08/2021 15:33      Jon    Clean       25    1210   [Clean]                yes
105      10/08/2021 15:31      James  Clean       25           [Clean,  Clean, Clean] no
106      10/08/2021 15:35      James  Clean       25           [Clean,  Clean, Clean] no
108      10/08/2021 15:37      James  Clean       25           [Clean,  Clean, Clean] no

CodePudding user response：

The bottleneck is likely the second part (the first one is fine).

You can use instead:

df['ID_Present'] = np.where(df['ID'].notna().groupby(df['Name']).transform('any'),
                            'yes', 'no')

NB. you use df['ID'].notna().groupby(df['Name']).any() and merge as previously

output:

   PA_ID              Date   Name     Status  Age      ID ID_Present
0    100  10/06/2021 13:32    Ash      Clean   40  1001.0        yes
1    101  10/06/2021 13:33    Ash  Not Clean   40     NaN        yes
2    102  10/07/2021 14:33    Eva      Clean   35     NaN         no
3    103  10/07/2021 15:33    Eva      Clean   35     NaN         no
4    104  10/08/2021 15:33    Jon      Clean   25  1210.0        yes
5    105  10/08/2021 15:33  James      Clean   25     NaN         no
6    106  10/08/2021 15:33  James      Clean   25     NaN         no

CodePudding user response：

If performance important avoid groupby:

df['ID_Present'] = np.where(df['Name'].isin(df.loc[df['ID'].notna(), 'Name']),
                           'yes', 'no')
print (df)
   PA_ID              Date   Name     Status  Age      ID ID_Present
0    100  10/06/2021 13:32    Ash      Clean   40  1001.0        yes
1    101  10/06/2021 13:33    Ash  Not Clean   40     NaN        yes
2    102  10/07/2021 14:33    Eva      Clean   35     NaN         no
3    103  10/07/2021 15:33    Eva      Clean   35     NaN         no
4    104  10/08/2021 15:33    Jon      Clean   25  1210.0        yes
5    105  10/08/2021 15:33  James      Clean   25     NaN         no
6    106  10/08/2021 15:33  James      Clean   25     NaN         no

How it working: First get all Names with non missing values by column ID:

print (df.loc[df['ID'].notna(), 'Name'])
0    Ash
4    Jon
Name: Name, dtype: object

And then test column Name by Series.isin:

print (df['Name'].isin(df.loc[df['ID'].notna(), 'Name']))
0     True
1     True
2    False
3    False
4     True
5    False
6    False
Name: Name, dtype: bool