Fastest way of querying and filtering pandas dataframes-CodePudding

I have 2 pandas dataframes which are read from 2 existing spreadsheets: The first one (df_summary) looks like this

Unique_ID	Age	Name	Surname
1	19	Paul	Smith
2	22	Chris	Lorne

The second one (df_mass) looks like

Unique_ID	Time	Mass
1	1	62.5
1	2	62.3
1	3	63.1
1	4	63.4
2	1	72.1
2	2	71.2
2	3	71.5
2	4	71.7
2	5	70.8

Showing the mass of people at different points in time. I would like to query df_mass to find the last recorded mass and add it to df_summary. At the moment I have it working like:

for unique_ID in df_summary['Unique_ID']:
        data = df_mass [df_mass ['Unique_ID'] == unique_ID]
        df_summary.loc[df_summary['Unique_ID'] == unique_ID, 'Final Mass'] = data['Mass'].tolist()[-1]

Which works but isn't very elegant and is quite slow when I have many participants and many time points for each. Any help appreciated

CodePudding user response：

Try this:

df_mass.sort_values(by='Time', ascending=True, inplace=True)['Mass']
data = df_mass.groupby('Unique_ID').agg({'Time': 'last', 'Mass': 'last'})
df_summary = df_summary.merge(data, on='Unique_ID', copy=False, how='left')
df_summary.rename({'Mass': 'Final Mass'}, inplace=True)

CodePudding user response：

Use groupby_apply:

>>> df1.merge(df2.groupby('Unique_ID')
                 .apply(lambda x: x.loc[x['Time'].idxmax(), 'Mass'])
                 .rename('Final Mass').reset_index(), how='left')

   Unique_ID  Age   Name Surname  Final Mass
0          1   19   Paul   Smith        63.4
1          2   22  Chris   Lorne        70.8

CodePudding user response：

You can use:

Optionally sort df_mass by Unique_ID and Time if not already in this order

df_mass = df_mass.sort_values(['Unique_ID', 'Time'])

Step 1: Create a df_Final_Mass dataframe by aggregating the last entry of Mass per Unique_ID by GroupBy.last():

df_Final_Mass = df_mass.groupby('Unique_ID')['Mass'].last().reset_index(name='Final Mass')


   Unique_ID  Final Mass
0          1        63.4
1          2        70.8

Step 2: Merge df_summary with df_Final_Mass on same Unique_ID:

df_final = df_summary.merge(df_Final_Mass, on='Unique_ID', how='left')

Result:

print(df_final)


   Unique_ID  Age   Name Surname  Final Mass
0          1   19   Paul   Smith        63.4
1          2   22  Chris   Lorne        70.8