Home > database >  Copy the row in pandas from the nearest year
Copy the row in pandas from the nearest year

Time:11-05

Lets say that I have the following datasets.

import numpy as np
import pandas as pd

d = {'name': ['a', 'a', 'a'], 'year': [2000, 2001, 2002], 'value': [10, 17, 22] }
data_frame  = pd.DataFrame(data=d)
data_frame

d1 = {'name': ['a', 'a', 'a'], 'year': [2001, 2002, 2003], 'value': [10, 17, 22] }
data_frame1  = pd.DataFrame(data=d1)
data_frame1

d2 = {'name': ['a', 'a', 'a'], 'year': [2000, 2002, 2003], 'value': [10, 17, 22] }
data_frame2  = pd.DataFrame(data=d2)
data_frame2

What I want to do is following:

  1. In data_frame, I miss observations for 2003, hence I want to copy the row which contains 2002 (the earliest available year) (data_frame[data_frame['year']==2002]) append it to my data_frame and replace 2002 with 2003. Rest stays the same.

  2. In data_frame1, I miss observations for 2000, I want to do the same as in 1, but in that case, I want to use the row which contains the year 2001.

  3. As for the data_frame2, I do not have observations for 2001. Here I want to use the earliest year, e.i. 2000.

Is there any way that I can do this, without inspecting each data frame separately, but instead write some general function?

Thanks.

CodePudding user response:

Use DataFrame.reindex with all years with parameter method='nearest', but for data_frame2 is used next year:

dfs = [data_frame,data_frame1, data_frame2]

for df in dfs:
    df = df.set_index('year').reindex(range(2000, 2004), method='nearest')
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     17
    2003    a     22

If need previous year if exist, last replace next year use:

for df in dfs:
    df = df.set_index('year').reindex(range(2000, 2004), method='ffill').bfill()
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a   10.0
    2001    a   10.0
    2002    a   17.0
    2003    a   22.0
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22

Solution per groups:

for df in dfs:
    df = (df.set_index('year').groupby('name', group_keys=False)
            .apply(lambda x: x.reindex(range(2000, 2004), method='ffill').bfill()))
    print (df)
         name  value
    year            
    2000    a     10
    2001    a     17
    2002    a     22
    2003    a     22
         name  value
    year            
    2000    a   10.0
    2001    a   10.0
    2002    a   17.0
    2003    a   22.0
         name  value
    year            
    2000    a     10
    2001    a     10
    2002    a     17
    2003    a     22
  • Related