I have 2 pandas dataframes which are read from 2 existing spreadsheets: The first one (df_summary) looks like this
Unique_ID | Age | Name | Surname |
---|---|---|---|
1 | 19 | Paul | Smith |
2 | 22 | Chris | Lorne |
The second one (df_mass) looks like
Unique_ID | Time | Mass |
---|---|---|
1 | 1 | 62.5 |
1 | 2 | 62.3 |
1 | 3 | 63.1 |
1 | 4 | 63.4 |
2 | 1 | 72.1 |
2 | 2 | 71.2 |
2 | 3 | 71.5 |
2 | 4 | 71.7 |
2 | 5 | 70.8 |
Showing the mass of people at different points in time. I would like to query df_mass to find the last recorded mass and add it to df_summary. At the moment I have it working like:
for unique_ID in df_summary['Unique_ID']:
data = df_mass [df_mass ['Unique_ID'] == unique_ID]
df_summary.loc[df_summary['Unique_ID'] == unique_ID, 'Final Mass'] = data['Mass'].tolist()[-1]
Which works but isn't very elegant and is quite slow when I have many participants and many time points for each. Any help appreciated
CodePudding user response:
Try this:
df_mass.sort_values(by='Time', ascending=True, inplace=True)['Mass']
data = df_mass.groupby('Unique_ID').agg({'Time': 'last', 'Mass': 'last'})
df_summary = df_summary.merge(data, on='Unique_ID', copy=False, how='left')
df_summary.rename({'Mass': 'Final Mass'}, inplace=True)
CodePudding user response:
Use groupby_apply
:
>>> df1.merge(df2.groupby('Unique_ID')
.apply(lambda x: x.loc[x['Time'].idxmax(), 'Mass'])
.rename('Final Mass').reset_index(), how='left')
Unique_ID Age Name Surname Final Mass
0 1 19 Paul Smith 63.4
1 2 22 Chris Lorne 70.8
CodePudding user response:
You can use:
Optionally sort df_mass
by Unique_ID
and Time
if not already in this order
df_mass = df_mass.sort_values(['Unique_ID', 'Time'])
Step 1: Create a df_Final_Mass
dataframe by aggregating the last entry of Mass
per Unique_ID
by GroupBy.last()
:
df_Final_Mass = df_mass.groupby('Unique_ID')['Mass'].last().reset_index(name='Final Mass')
Unique_ID Final Mass
0 1 63.4
1 2 70.8
Step 2: Merge df_summary
with df_Final_Mass
on same Unique_ID
:
df_final = df_summary.merge(df_Final_Mass, on='Unique_ID', how='left')
Result:
print(df_final)
Unique_ID Age Name Surname Final Mass
0 1 19 Paul Smith 63.4
1 2 22 Chris Lorne 70.8