Home > Software design >  Replace values of 0 in dataframe using mathematical formulas
Replace values of 0 in dataframe using mathematical formulas

Time:07-05

I have two Pandas dataframes as below and I am trying to replace the 0 values from column Price in the first one.

The first dataframe looks like this:

Region Country Product Year Price
Africa South Africa ABC 2016 500
Africa South Africa ABC 2017 400
Africa South Africa ABC 2018 0
Africa South Africa ABC 2019 450
Africa South Africa XYZ 2016 750
Africa South Africa XYZ 2017 0
Africa South Africa XYZ 2018 0
Africa South Africa XYZ 2019 890
Asia Japan DEF 2016 0
Asia Japan DEF 2017 0
Asia Japan DEF 2018 415
Asia Japan DEF 2019 0

It contains price for each type of product for each country for a given year. Some countries have values of 0 for certain years as that data is missing.

The second dataframe contains the average yearly price growth for each region and product and looks like this.

Region Product Average Yearly Price Growth
Africa ABC 0.043
Africa DEF 0.071
Africa XYZ 0.0128
Asia XYZ 0.05
Asia ABC -0.009
Asia DEF 0.021

The task is to replace all 0 values in the first dataframe with the previous year's value for that particular product in that particular country, plus the average yearly rate increase (assuming the earliest data point is available).

In the case of product DEF in Japan, I would want to detect the earliest year where data is available and calculate all the previous years where Price is 0 from that point as follows:

Product DEF in Japan in 2018 has a price of 415 and is the earliest available data point for Price. Therefore the price for 2017 should be 415/(1 0.021)=406.464. After that, the value for 2016 should be 406.464/(1 0.021). Once these values have been calculated, the value for DEF in Japan in 2019 should be 415 415*0.021=423.715.

For other countries such as South Africa where the price for the earliest year (2016) is available it should be easier. For example, for product XYZ in South Africa in 2017, instead of 0 the value should be 750 750*0.0128=759.6. Where 750 is the previous year value for that product in that country and 0.0128 is the average yearly price growth for that product in Africa (from dataframe 2).

Then for the same product but in 2018, the value should be the previously calculated one, 759.6 759.6*0.0128=769.32.

Here is the sample data:

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 0,450,750,0,0,890,0,0,415,0]}

data2 = {'Region': ['Africa','Africa','Africa','Asia','Asia','Asia'],
         'Product': ['ABC','DEF','XYZ','XYZ','ABC','DEF'],
         'Average Yearly Price Growth': [0.043, 0.071, 0.0128,0.05,-0.009,0.021]}

df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

My code so far is below, but it only works for situations where the earlies available point is available, like South Africa. Any ideas on how to deal with situations like Japan?

for index,rows in df.iterrows():
     if rows['Price']==0:
        try:
                    past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
                    df.at[index,'Price'] = past_year_value   past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
        except:
                    pass

CodePudding user response:

Simple fix add another for loop after that, this is more like forward fill and backward fill for the value of 0 , since we need to step , just like bfill and ffill chain

for index,rows in df.iterrows():
     if rows['Price']==0:
        try:
                    past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year']-1)]['Price'].values[0]
                    df.at[index,'Price'] = past_year_value   past_year_value*df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0]
        except:
                    pass

df = df.iloc[::-1]
for index,rows in df.iterrows():
     if rows['Price']==0:
        try:
                    past_year_value = df[(df['Country']==rows['Country']) & (df['Product']==rows['Product']) & (df['Region']==rows['Region']) & (df['Year']==rows['Year'] 1)]['Price'].values[0]
                    df.at[index,'Price'] = past_year_value /(1 df2[(df2['Product']==rows['Product']) & (df2['Region']==rows['Region'])]['Average Yearly Price Growth'].values[0])
        except:
                    pass
df = df.iloc[::-1]
Out[88]: 
    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016    500
1   Africa  South Africa     ABC  2017    400
2   Africa  South Africa     ABC  2018    417
3   Africa  South Africa     ABC  2019    450
4   Africa  South Africa     XYZ  2016    750
5   Africa  South Africa     XYZ  2017    759
6   Africa  South Africa     XYZ  2018    768
7   Africa  South Africa     XYZ  2019    890
8     Asia         Japan     DEF  2016    397
9     Asia         Japan     DEF  2017    406
10    Asia         Japan     DEF  2018    415
11    Asia         Japan     DEF  2019    423
  • Related