(Pandas) How to replace certain values of a column from a different dataset but leave other values i-CodePudding

Let's say I have the dataset:

df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`

print(df1)

#output:
   number decade
0       0   1970
1       0   1980
2       0   1990
3       0   2000
4       0   2010

and I want to merge it with another dataset:

df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]

print(df2)

#output:
   number decade
0       1   1990
1       1   2010

such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:

   number decade
0       0   1970
1       0   1980
2       1   1990
3       0   2000
4       1   2010

how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?

Thank you so much!

CodePudding user response：

You can use pandas.DataFrame.merge with pandas.Series.fillna :

out = (
        df1[["decade"]]
            .merge(df2, on="decade", how="left")
            .fillna({"number": df1["number"]}, downcast="infer")
      )

# Output :

print(out)

  decade  number
0   1970       0
1   1980       0
2   1990       1
3   2000       0
4   2010       1

CodePudding user response：

What about using apply?

First you create a function

def validation(previous,latest):
    if pd.isna(latest):
        return previous
    else:
        return latest

Then you can use the function dataframe.apply to compare the data in df1 to df2

df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)

Your result:

   number decade
0       0   1970
1       0   1980
2       1   1990
3       0   2000
4       1   2010