Home > other >  Replace values from a dataframe with values from another with Pandas
Replace values from a dataframe with values from another with Pandas

Time:07-08

I have two dataframes with identical columns, but different values and different number of rows.

import pandas as pd

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [500, 400, 0,450,750,0,0,890,500,470,0,415]}

data1 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017],
         'Price': [200, 100, 30,750,350,120,400,370]}

df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df is the complete dataset but with some old values, whereas df2 only has the updated values. I want to replace all the values that are in df with the values in df2, all while keeping the values from df that aren't in df2.

So for example, in df, the value for Country = Japan, for Product = DEF, in Year = 2016, the Price should be updated from 470 to 400. The same for 2017, while 2018 and 2019 stay the same.

So far I have the following code that doesn't seem to work:

common_index = ['Region','Country','Product','Year']
df = df.set_index(common_index)
df2 = df2.set_index(common_index)
df.update(df2, overwrite = True)

But this only updates df with the values from df2 and deletes everything else.

Expected output should look like this:

data3 = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia','Asia'],
         'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan','Japan'],
         'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF','DEF'],
         'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2018, 2019],
         'Price': [200, 100, 30,750,350,120,0,890,400,370,0,415]}

df3 = pd.DataFrame(data3)

Any suggestions on how I can do this?

CodePudding user response:

You can use merge and update:

df.update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'],
                   how='left', suffixes=('_old', None)))

NB. the update is in place.

output:

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016  200.0
1   Africa  South Africa     ABC  2017  100.0
2   Africa  South Africa     ABC  2018   30.0
3   Africa  South Africa     ABC  2019  750.0
4   Africa  South Africa     XYZ  2016  350.0
5   Africa  South Africa     XYZ  2017  120.0
6   Africa  South Africa     XYZ  2018    0.0
7   Africa  South Africa     XYZ  2019  890.0
8     Asia         Japan     DEF  2016  400.0
9     Asia         Japan     DEF  2017  370.0
10    Asia         Japan     DEF  2018    0.0
11    Asia         Japan     DEF  2019  415.0

CodePudding user response:

You can use

df['Price'].update(df.merge(df2, on=['Region', 'Country', 'Product', 'Year'], how='left')['Price_y'])
print(df)

    Region       Country Product  Year  Price
0   Africa  South Africa     ABC  2016    200
1   Africa  South Africa     ABC  2017    100
2   Africa  South Africa     ABC  2018     30
3   Africa  South Africa     ABC  2019    750
4   Africa  South Africa     XYZ  2016    350
5   Africa  South Africa     XYZ  2017    120
6   Africa  South Africa     XYZ  2018      0
7   Africa  South Africa     XYZ  2019    890
8     Asia         Japan     DEF  2016    400
9     Asia         Japan     DEF  2017    370
10    Asia         Japan     DEF  2018      0
11    Asia         Japan     DEF  2019    415

CodePudding user response:

I don't know if this is the case but what if df2 carry something not listed in df1? Here I'm adding a row to df2 with data Asia, Japan, DEF, 2020, 400.

import pandas as pd
import numpy as np

data1 = {
    'Region': ['Africa','Africa','Africa','Africa',
               'Africa','Africa','Africa','Africa',
               'Asia','Asia','Asia','Asia'],
    'Country': ['South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'South Africa','South Africa','South Africa',
                'Japan','Japan','Japan','Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','XYZ',
                'XYZ','DEF','DEF','DEF','DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017, 2018,
             2019,2016, 2017, 2018, 2019],
    'Price': [500, 400, 0,450,750,0,0,890,500,
              470,0,415]}

data2 = {
    'Region': ['Africa','Africa','Africa','Africa','Africa',
               'Africa','Asia','Asia', 'Asia'],
    'Country': ['South Africa','South Africa','South Africa',
                'South Africa','South Africa',
                'South Africa','Japan','Japan', 'Japan'],
    'Product': ['ABC','ABC','ABC','ABC','XYZ','XYZ','DEF',
                'DEF', 'DEF'],
    'Year': [2016, 2017, 2018, 2019,2016, 2017,2016, 2017, 2020],
    'Price': [200, 100, 30,750,350,120,400,370, 400]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

Here I call df1 the first dataframe instead of df. Then I'm adding few step so we know exactly what is going on.

First I rename Price to Price_new in df2 then I'll do an outer join between the 2 dataframes.

df2 = df2.rename(columns={"Price": "Price_new"})
cols_merge = ['Region', 'Country', 'Product', 'Year']
df = pd.merge(df1, df2, how="outer", on=cols_merge)

which gives

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  500.0      200.0
1   Africa  South Africa     ABC  2017  400.0      100.0
2   Africa  South Africa     ABC  2018    0.0       30.0
3   Africa  South Africa     ABC  2019  450.0      750.0
4   Africa  South Africa     XYZ  2016  750.0      350.0
5   Africa  South Africa     XYZ  2017    0.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  500.0      400.0
9     Asia         Japan     DEF  2017  470.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020    NaN      400.0

Now wherever Price_new is not null we update the Price column

df["Price"] = np.where(
    df["Price_new"].notnull(),
    df["Price_new"],
    df["Price"])

The output being

    Region       Country Product  Year  Price  Price_new
0   Africa  South Africa     ABC  2016  200.0      200.0
1   Africa  South Africa     ABC  2017  100.0      100.0
2   Africa  South Africa     ABC  2018   30.0       30.0
3   Africa  South Africa     ABC  2019  750.0      750.0
4   Africa  South Africa     XYZ  2016  350.0      350.0
5   Africa  South Africa     XYZ  2017  120.0      120.0
6   Africa  South Africa     XYZ  2018    0.0        NaN
7   Africa  South Africa     XYZ  2019  890.0        NaN
8     Asia         Japan     DEF  2016  400.0      400.0
9     Asia         Japan     DEF  2017  370.0      370.0
10    Asia         Japan     DEF  2018    0.0        NaN
11    Asia         Japan     DEF  2019  415.0        NaN
12    Asia         Japan     DEF  2020  400.0      400.0

And you can evertually remove the extra column with

df = df.drop(columns=["Price_new"])

Note

The other solutions are great and I upvoted them. I added this to show you that sometime is better to use less specific code in order to have better control and maintainability in your code.

  • Related