Lets say that I have the following datasets.
import numpy as np
import pandas as pd
d = {'name': ['a', 'a', 'a'], 'year': [2000, 2001, 2002], 'value': [10, 17, 22] }
data_frame = pd.DataFrame(data=d)
data_frame
d1 = {'name': ['a', 'a', 'a'], 'year': [2001, 2002, 2003], 'value': [10, 17, 22] }
data_frame1 = pd.DataFrame(data=d1)
data_frame1
d2 = {'name': ['a', 'a', 'a'], 'year': [2000, 2002, 2003], 'value': [10, 17, 22] }
data_frame2 = pd.DataFrame(data=d2)
data_frame2
What I want to do is following:
In
data_frame
, I miss observations for 2003, hence I want to copy the row which contains 2002 (the earliest available year)(data_frame[data_frame['year']==2002])
append it to mydata_frame
and replace 2002 with 2003. Rest stays the same.In
data_frame1
, I miss observations for 2000, I want to do the same as in 1, but in that case, I want to use the row which contains the year 2001.As for the
data_frame2
, I do not have observations for 2001. Here I want to use the earliest year, e.i. 2000.
Is there any way that I can do this, without inspecting each data frame separately, but instead write some general function?
Thanks.
CodePudding user response:
Use DataFrame.reindex
with all years with parameter method='nearest'
, but for data_frame2
is used next year:
dfs = [data_frame,data_frame1, data_frame2]
for df in dfs:
df = df.set_index('year').reindex(range(2000, 2004), method='nearest')
print (df)
name value
year
2000 a 10
2001 a 17
2002 a 22
2003 a 22
name value
year
2000 a 10
2001 a 10
2002 a 17
2003 a 22
name value
year
2000 a 10
2001 a 17
2002 a 17
2003 a 22
If need previous year if exist, last replace next year use:
for df in dfs:
df = df.set_index('year').reindex(range(2000, 2004), method='ffill').bfill()
print (df)
name value
year
2000 a 10
2001 a 17
2002 a 22
2003 a 22
name value
year
2000 a 10.0
2001 a 10.0
2002 a 17.0
2003 a 22.0
name value
year
2000 a 10
2001 a 10
2002 a 17
2003 a 22
Solution per groups:
for df in dfs:
df = (df.set_index('year').groupby('name', group_keys=False)
.apply(lambda x: x.reindex(range(2000, 2004), method='ffill').bfill()))
print (df)
name value
year
2000 a 10
2001 a 17
2002 a 22
2003 a 22
name value
year
2000 a 10.0
2001 a 10.0
2002 a 17.0
2003 a 22.0
name value
year
2000 a 10
2001 a 10
2002 a 17
2003 a 22