Suppose I have a dataframe like this:
country | year | value |
---|---|---|
A | 2008 | 1 |
A | 2011 | 1 |
B | 2008 | 1 |
B | 2011 | 1 |
I want to add missing year per country, in this case 2009 and 2010, with desired output like this:
country | year | value |
---|---|---|
A | 2008 | 1 |
A | 2009 | |
A | 2010 | |
A | 2011 | 1 |
B | 2008 | 1 |
B | 2009 | |
B | 2010 | |
B | 2011 | 1 |
How can I do that? Thanks in advance!
CodePudding user response:
Let's create a dataframe first as follows :
import pandas as pd
data = {'country' : ['A', 'A', 'B', 'B'],
'year' : ['2008', '2011', '2008', '2011'],
'value':[1,1,1,1]}
df = pd.DataFrame(data=data)
Created dataset :
country year value
0 A 2008 1
1 A 2011 1
2 B 2008 1
3 B 2011 1
Lets define the years we need to consider :
yr_list = ['2008', '2009', '2010', '2011']
Lets modify the dataset based on our requirement :
for country in df['country'].unique() :
for yr in yr_list :
if yr not in list(df.loc[df['country'] == country, 'year']):
update_data = {'country' : country, 'year' : yr}
df = df.append(update_data, ignore_index = True)
final_df = df.sort_values(by = ['country', 'year'],ignore_index=True)
print(final_df)
The final output :
country year value
0 A 2008 1.0
1 A 2009 NaN
2 A 2010 NaN
3 A 2011 1.0
4 B 2008 1.0
5 B 2009 NaN
6 B 2010 NaN
7 B 2011 1.0
CodePudding user response:
First let's create your dataset for the MCVE sake:
import pandas as pd
frame = pd.DataFrame([
{"country": "A", "year": 2008, "value": 1},
{"country": "A", "year": 2011, "value": 1},
{"country": "B", "year": 2008, "value": 1},
{"country": "B", "year": 2011, "value": 1},
])
Then we create the missing data by ruling from min(year)
to max(year)
:
extension = frame.groupby("country")["year"].agg(["min", "max"]).reset_index()
extension["year"] = extension.apply(lambda x: list(range(x["min"], x["max"] 1)), axis=1)
# country min max year
# 0 A 2008 2011 [2008, 2009, 2010, 2011]
# 1 B 2008 2011 [2008, 2009, 2010, 2011]
Exploding the structure gives the correct format but without values:
extension = extension.explode("year")[["country", "year"]]
extension["year"] = extension["year"].astype(int)
# country year
# 0 A 2008
# 0 A 2009
# 0 A 2010
# 0 A 2011
# 1 B 2008
# 1 B 2009
# 1 B 2010
# 1 B 2011
Then we merge back with the original data to get the values:
results = frame.merge(extension, how="right", on=["country", "year"])
# country year value
# 0 A 2008 1.0
# 1 A 2009 NaN
# 2 A 2010 NaN
# 3 A 2011 1.0
# 4 B 2008 1.0
# 5 B 2009 NaN
# 6 B 2010 NaN
# 7 B 2011 1.0
The advantage of this method - in addition of being purely pandas - is that it is robust against data variation:
frame = pd.DataFrame([
{"country": "A", "year": 2008, "value": 1},
{"country": "A", "year": 2011, "value": 2},
{"country": "B", "year": 2005, "value": 1},
{"country": "B", "year": 2009, "value": 2},
{"country": "C", "year": 2008, "value": 1},
{"country": "C", "year": 2010, "value": 2},
{"country": "C", "year": 2012, "value": 3},
])
# country year value
# 0 A 2008 1.0
# 1 A 2009 NaN
# 2 A 2010 NaN
# 3 A 2011 2.0
# 4 B 2005 1.0
# 5 B 2006 NaN
# 6 B 2007 NaN
# 7 B 2008 NaN
# 8 B 2009 2.0
# 9 C 2008 1.0
# 10 C 2009 NaN
# 11 C 2010 2.0
# 12 C 2011 NaN
# 13 C 2012 3.0
CodePudding user response:
arr1 = [['A', 2008, 1],['A', 2011, 1],['B', 2008, 1],['B', 2011, 1]]
arr2 = [['A', 2008, 1],['A', 2009, None],['A', 2010, None],à['A', 2011, 1],['B', 2008, 1],['B', 2009, None],['B', 2010, None],['B', 2011, 1]]
for elm in arr2:
if elm not in arr1:
arr1.append(elm)