I have a pandas dataframe as shown below :
PA_ID Date Name Status Age ID
100 10/06/2021 13:32 Ash Clean 40 1001
101 10/06/2021 13:33 Ash Not Clean 40 NaN
102 10/07/2021 14:33 Eva Clean 35 NaN
103 10/07/2021 15:33 Eva Clean 35 NaN
104 10/08/2021 15:33 Jon Clean 25 1210
105 10/08/2021 15:33 James Clean 25 NaN
106 10/08/2021 15:33 James Clean 25 NaN
I want to track the status of each person and put that into a list as a new column "Tracking" and then merge it with my original dataframe, similarly I will be check if any of the PA_ID correspond to a person have any ID present, if yes; then ID_Present is yes , else no.
This is how I am doing it
# Part one to track the customer status
df1 = df.groupby('Name')['Status'].apply(list).reset_index(name='Tracking')
df2 = df.merge(df1,how='inner',on='Name')
# Part two ; checking if ID is present
df3 = df.groupby('Name')['ID'].apply(list).reset_index(name='Mylist')
df3['Mylist'] = df3[['Mylist']].applymap(lambda x: [*filter(pd.notna, x)])
df3['ID_Present'] = df3['Mylist'].apply(lambda y: 'no' if len(y)==0 else 'yes')
# Fetching the final dataset
final = df2.merge(df3[['Name','ID_Present']],how='inner',on='Name')
Is there an efficient way to do this, less code and fast execution. I am pretty sure there are some pandas methods that I can use to do this cleanly but my knowledge with pandas is limited. thanks in advance
Expected Output:
PA_ID Date Name Status Age ID Tracking ID_Present
100 10/06/2021 13:32 Ash Clean 40 1001 [Clean, Not Clean] yes
101 10/06/2021 13:33 Ash Not Clean 40 [Clean, Not Clean] yes
102 10/07/2021 14:33 Eva Clean 35 [Clean, Clean] no
103 10/07/2021 15:33 Eva Clean 35 [Clean, Clean] no
104 10/08/2021 15:33 Jon Clean 25 1210 [Clean] yes
105 10/08/2021 15:31 James Clean 25 [Clean, Clean, Clean] no
106 10/08/2021 15:35 James Clean 25 [Clean, Clean, Clean] no
108 10/08/2021 15:37 James Clean 25 [Clean, Clean, Clean] no
CodePudding user response:
The bottleneck is likely the second part (the first one is fine).
You can use instead:
df['ID_Present'] = np.where(df['ID'].notna().groupby(df['Name']).transform('any'),
'yes', 'no')
NB. you use df['ID'].notna().groupby(df['Name']).any()
and merge as previously
output:
PA_ID Date Name Status Age ID ID_Present
0 100 10/06/2021 13:32 Ash Clean 40 1001.0 yes
1 101 10/06/2021 13:33 Ash Not Clean 40 NaN yes
2 102 10/07/2021 14:33 Eva Clean 35 NaN no
3 103 10/07/2021 15:33 Eva Clean 35 NaN no
4 104 10/08/2021 15:33 Jon Clean 25 1210.0 yes
5 105 10/08/2021 15:33 James Clean 25 NaN no
6 106 10/08/2021 15:33 James Clean 25 NaN no
CodePudding user response:
If performance important avoid groupby
:
df['ID_Present'] = np.where(df['Name'].isin(df.loc[df['ID'].notna(), 'Name']),
'yes', 'no')
print (df)
PA_ID Date Name Status Age ID ID_Present
0 100 10/06/2021 13:32 Ash Clean 40 1001.0 yes
1 101 10/06/2021 13:33 Ash Not Clean 40 NaN yes
2 102 10/07/2021 14:33 Eva Clean 35 NaN no
3 103 10/07/2021 15:33 Eva Clean 35 NaN no
4 104 10/08/2021 15:33 Jon Clean 25 1210.0 yes
5 105 10/08/2021 15:33 James Clean 25 NaN no
6 106 10/08/2021 15:33 James Clean 25 NaN no
How it working: First get all Name
s with non missing values by column ID
:
print (df.loc[df['ID'].notna(), 'Name'])
0 Ash
4 Jon
Name: Name, dtype: object
And then test column Name
by Series.isin
:
print (df['Name'].isin(df.loc[df['ID'].notna(), 'Name']))
0 True
1 True
2 False
3 False
4 True
5 False
6 False
Name: Name, dtype: bool