I have two dataframes and I am trying to replace all values of 0 from the first dataframe with certain values from the second dataframe.
The first dataframe looks like this:
Region | Country | Product | Quantity | Price |
---|---|---|---|---|
Africa | South Africa | ABC | 500 | 1200 |
Africa | South Africa | DEF | 200 | 400 |
Africa | South Africa | XYZ | 0 | 0 |
Africa | Nigeria | DEF | 150 | 450 |
Africa | Nigeria | XYZ | 200 | 750 |
Asia | Japan | XYZ | 100 | 500 |
Asia | Japan | ABC | 200 | 500 |
Asia | Japan | DEF | 0 | 0 |
Asia | India | XYZ | 250 | 600 |
Asia | India | ABC | 100 | 400 |
Asia | India | DEF | 40 | 220 |
It contains quantities and total price for each type of product for each country. Some country and product combinations have values of 0.
The second dataframe looks like this:
Region | Product | Quantity | Price |
---|---|---|---|
Africa | ABC | 375 | 825 |
Africa | DEF | 125 | 250 |
Africa | XYZ | 110 | 300 |
Asia | XYZ | 200 | 500 |
Asia | ABC | 200 | 500 |
Asia | DEF | 120 | 300 |
The second dataframe contains the region averages for quantity and price across each product.
The task is to replace all 0 values in dataframe 1 with the averages for that region and product from dataframe 2.
For example, the 0 values for South Africa for product XYZ in dataframe 1 should be replaced with the values for product XYZ for Africa in dataframe 2. The other values for South Africa should stay the same and so on.
Any tips on how to do this?
CodePudding user response:
First, get all average values merged into the first df, a joining is suffice.
df1_merged = df1.merge(df2, how='left', on=['Region', 'Product'], suffixes=('','_Average'))
we get df1_merged
as
Region | Country | Product | Quantity | Price | Quantity_Average | Price_Average | |
---|---|---|---|---|---|---|---|
0 | Africa | South Africa | ABC | 500 | 1200 | 375 | 825 |
1 | Africa | South Africa | DEF | 200 | 400 | 125 | 250 |
2 | Africa | South Africa | XYZ | 0 | 0 | 110 | 300 |
3 | Africa | Nigeria | DEF | 150 | 450 | 125 | 250 |
4 | Africa | Nigeria | XYZ | 200 | 750 | 110 | 300 |
5 | Asia | Japan | XYZ | 100 | 500 | 200 | 500 |
6 | Asia | Japan | ABC | 200 | 500 | 200 | 500 |
7 | Asia | Japan | DEF | 0 | 0 | 120 | 300 |
8 | Asia | India | XYZ | 250 | 600 | 200 | 500 |
9 | Asia | India | ABC | 100 | 400 | 200 | 500 |
10 | Asia | India | DEF | 40 | 220 | 120 | 300 |
Then we can fillin these zero with Average
columns with conditional loc
df1_merged.loc[df1_merged['Quantity'] == 0, 'Quantity'] = \
df1_merged['Quantity_Average']
df1_merged.loc[df1_merged['Price'] == 0, 'Price'] = \
df1_merged['Price_Average']
now we get df1_merged
as
Region | Country | Product | Quantity | Price | Quantity_Average | Price_Average | |
---|---|---|---|---|---|---|---|
0 | Africa | South Africa | ABC | 500 | 1200 | 375 | 825 |
1 | Africa | South Africa | DEF | 200 | 400 | 125 | 250 |
2 | Africa | South Africa | XYZ | 110 | 300 | 110 | 300 |
3 | Africa | Nigeria | DEF | 150 | 450 | 125 | 250 |
4 | Africa | Nigeria | XYZ | 200 | 750 | 110 | 300 |
5 | Asia | Japan | XYZ | 100 | 500 | 200 | 500 |
6 | Asia | Japan | ABC | 200 | 500 | 200 | 500 |
7 | Asia | Japan | DEF | 120 | 300 | 120 | 300 |
8 | Asia | India | XYZ | 250 | 600 | 200 | 500 |
9 | Asia | India | ABC | 100 | 400 | 200 | 500 |
10 | Asia | India | DEF | 40 | 220 | 120 | 300 |
Finally, drop unneeded columns with
df1_merged = df1_merged.drop(['Quantity_Average', 'Price_Average'], axis=1)
and we get
Region | Country | Product | Quantity | Price | |
---|---|---|---|---|---|
0 | Africa | South Africa | ABC | 500 | 1200 |
1 | Africa | South Africa | DEF | 200 | 400 |
2 | Africa | South Africa | XYZ | 110 | 300 |
3 | Africa | Nigeria | DEF | 150 | 450 |
4 | Africa | Nigeria | XYZ | 200 | 750 |
5 | Asia | Japan | XYZ | 100 | 500 |
6 | Asia | Japan | ABC | 200 | 500 |
7 | Asia | Japan | DEF | 120 | 300 |
8 | Asia | India | XYZ | 250 | 600 |
9 | Asia | India | ABC | 100 | 400 |
10 | Asia | India | DEF | 40 | 220 |
CodePudding user response:
You can use update
after replacing 0
with np.nan
and setting a common index
between the two dataframes.
Be wary of two things:
- Use
overwrite=False
to only fill the null values update
modifiesinplace
common_index = ['Region','Product']
df_indexed = df.replace(0,np.nan).set_index(common_index)
df2_indexed = df2.set_index(common_index)
df_indexed.update(df2_indexed,overwrite=False)
print(df_indexed.reset_index())
Region Product Country Quantity Price
0 Africa ABC South Africa 500.0 1200.0
1 Africa DEF South Africa 200.0 400.0
2 Africa XYZ South Africa 110.0 300.0
3 Africa DEF Nigeria 150.0 450.0
4 Africa XYZ Nigeria 200.0 750.0
5 Asia XYZ Japan 100.0 500.0
6 Asia ABC Japan 200.0 500.0
7 Asia DEF Japan 120.0 300.0
8 Asia XYZ India 250.0 600.0
9 Asia ABC India 100.0 400.0
10 Asia DEF India 40.0 220.0