I have two Pandas dataframes as below and I am trying to replace the 0
values from column Price
in the first one.
The first dataframe looks like this:
Region | Country | Product | Year | Price |
---|---|---|---|---|
Africa | South Africa | ABC | 2016 | 500 |
Africa | South Africa | ABC | 2017 | 400 |
Africa | South Africa | ABC | 2018 | 0 |
Africa | South Africa | ABC | 2019 | 450 |
Africa | South Africa | XYZ | 2016 | 750 |
Africa | South Africa | XYZ | 2017 | 0 |
Africa | South Africa | XYZ | 2018 | 0 |
Africa | South Africa | XYZ | 2019 | 890 |
Asia | Japan | DEF | 2016 | 0 |
Asia | Japan | DEF | 2017 | 0 |
Asia | Japan | DEF | 2018 | 415 |
Asia | Japan | DEF | 2019 | 0 |
It contains price for each type of product for each country for a given year. Some countries have values of 0
for certain years as that data is missing.
The second dataframe contains the average yearly price growth for each region and product and looks like this.
Region | Product | Average Yearly Price Growth |
---|---|---|
Africa | ABC | 0.043 |
Africa | DEF | 0.071 |
Africa | XYZ | 0.0128 |
Asia | XYZ | 0.05 |
Asia | ABC | -0.009 |
Asia | DEF | 0.021 |
The task is to replace all 0
values in the first dataframe with the previous year's value for that particular product in that particular country, plus the average yearly rate increase (assuming the earliest data point is available).
In the case of product DEF
in Japan
, I would want to detect the earliest year where data is available and calculate all the previous years where Price
is 0
from that point as follows:
Product DEF
in Japan
in 2018
has a price
of 415
and is the earliest available data point for Price
. Therefore the price
for 2017
should be 415/(1 0.021)=406.464
. After that, the value for 2016
should be 406.464/(1 0.021)
. Once these values have been calculated, the value for DEF
in Japan
in 2019
should be 415 415*0.021=423.715
.
For other countries such as South Africa
where the price for the earliest year (2016
) is available it should be easier. For example, for product XYZ
in South Africa
in 2017
, instead of 0
the value should be 750 750*0.0128=759.6
. Where 750
is the previous year value for that product in that country and 0.0128
is the average yearly price growth for that product in Africa (from dataframe 2).
Then for the same product but in 2018, the value should be the previously calculated one, 759.6 759.6*0.0128=769.32
.
Here is the sample data:
data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
'Price': [500, 400, 0,450,750,0,0,890,0,0,415,0]}
data2 = {'Region': ['Africa','Africa','Africa','Asia','Asia','Asia'],
'Product': ['ABC','DEF','XYZ','XYZ','ABC','DEF'],
'Average Yearly Price Growth': [0.043, 0.071, 0.0128,0.05,-0.009,0.021]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
My code so far is below, but it only works for situations where the earlies available point is available, like South Africa
. Any ideas on how to deal with situations like Japan
?
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
except:
pass
CodePudding user response:
Simple fix add another for loop after that, this is more like forward fill and backward fill for the value of 0 , since we need to step , just like bfill
and ffill
chain
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
except:
pass
df = df.iloc[::-1]
for index,rows in df.iterrows():
if rows['Price']==0:
try:
past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year'] 1)]['Price'].values[0]
df.at[index,'Price'] = past_year_value /(1 df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0])
except:
pass
df = df.iloc[::-1]
Out[88]:
Region Country Product Year Price
0 Africa South Africa ABC 2016 500
1 Africa South Africa ABC 2017 400
2 Africa South Africa ABC 2018 417
3 Africa South Africa ABC 2019 450
4 Africa South Africa XYZ 2016 750
5 Africa South Africa XYZ 2017 759
6 Africa South Africa XYZ 2018 768
7 Africa South Africa XYZ 2019 890
8 Asia Japan DEF 2016 397
9 Asia Japan DEF 2017 406
10 Asia Japan DEF 2018 415
11 Asia Japan DEF 2019 423