Home > Software design >  Getting the latest record available from a DataFrame
Getting the latest record available from a DataFrame

Time:11-15

Currently I have a DataFrame as below:

import pandas as pd
import numpy as np

d = {'name': ['a', 'a','a','b','b','b','c','c','c'],
     'Year': ['2000', '2010', '2020', '2000', '2010', '2020', '2000', '2010', '2020'],
     'v1': [np.NaN, np.NaN, np.NaN, 41, 51, 61, 71, 81, 91],
     'v2': [12, 22, 32, np.NaN, 52, np.NaN, 72, 82, 92],
     'v3': [13, 23, 33, 43, 53, 63, np.NaN, 83, np.NaN]}

df = pd.DataFrame(d)
df

name  Year    v1   v2   v3
 a    2000   NaN   12   13
 a    2010   NaN   22   23
 a    2020   NaN   32   33
 b    2000    41  NaN   43
 b    2010    51   52   53
 b    2020    61  NaN   63
 c    2000    71   72  NaN
 c    2010    81   82   83
 c    2020    91   92  NaN

And I'm trying to just get the latest record available for each person as follows:

name    v1   v2   v3
 a     NaN   32   33
 b      61   52   63
 c      91   92   83

Is there any way I am able to achieve this?

CodePudding user response:

Simply use groupby last and as_index=False as parameter for groupby:

df.groupby('name', as_index=False).last()

Alternatively, if you know that the last year is "2020":

df.query('Year == "2020"')

output:

  name  Year    v1    v2    v3
0    a  2020   NaN  32.0  33.0
1    b  2020  61.0  52.0  63.0
2    c  2020  91.0  92.0  83.0
  • Related