Here's my dataset df1
Id Value month Year
1 672 4 2020
1 356 6 2020
2 682 6 2019
3 366 4 2021
Here's my dataset df2
Id Value month Year
1 671 4 2020
1 353 6 2020
2 682 6 2019
3 363 4 2021
Here's my expected dataset df
that is used df2
from month=5 Year=2020 and before that using df1
Id Value month Year
1 671 4 2020
1 353 6 2020
2 682 6 2019
3 363 4 2021
Note: The original need is using pyspark, but in this question I'm exploring pandas alternatives
My Idea:
df1['code'] = df1['year']*100 df1['month']
df2['code'] = df2['year']*100 df2['month']
df1 = df1[df1['code'] =< 202004]
df2 = df2[df2['code'] => 202005]
df = df1.append(df2)
I think there's a way that do that more effective way
CodePudding user response:
Another option is to use mask
:
df = df1.mask((df1['month'].ge(5) & df1['Year'].eq(2020)) | df1['Year'].ge(2021), df2)
Output:
Id Value month Year
0 1 672 4 2020
1 1 353 6 2020
2 2 682 6 2019
3 3 363 4 2021
CodePudding user response:
Assuming the dataframes are aligned, you can use combine_first
:
(df1
.where(df1['Year'].mul(100).add(df1['month']).lt(202004))
.combine_first(df2)
.convert_dtypes()
)
Or more simply:
(df1
.where(df1['Year'].mul(100).add(df1['month']).lt(202004), df2)
)
Output:
Id Value month Year
0 1 671 4 2020
1 1 353 6 2020
2 2 682 6 2019
3 3 363 4 2021
Otherwise your approach is correct, just use concat
in place of append
df1 = df1[(df1['year']*100 df1['month']) =< 202004]
df2 = df2[(df2['year']*100 df2['month']) => 202005]
df = pd.concat([df1,df2])