Home > Software engineering >  How to do append based on multiple filter on pandas dataframe more effectively
How to do append based on multiple filter on pandas dataframe more effectively

Time:04-21

Here's my dataset df1

Id   Value  month   Year
1    672        4   2020  
1    356        6   2020
2    682        6   2019  
3    366        4   2021

Here's my dataset df2

Id   Value  month   Year
1    671        4   2020  
1    353        6   2020
2    682        6   2019  
3    363        4   2021

Here's my expected dataset df that is used df2 from month=5 Year=2020 and before that using df1

Id   Value  month   Year
1    671        4   2020  
1    353        6   2020
2    682        6   2019  
3    363        4   2021

Note: The original need is using pyspark, but in this question I'm exploring pandas alternatives

My Idea:

df1['code'] = df1['year']*100   df1['month']
df2['code'] = df2['year']*100   df2['month']
df1 = df1[df1['code'] =< 202004]
df2 = df2[df2['code'] => 202005]
df = df1.append(df2)

I think there's a way that do that more effective way

CodePudding user response:

Another option is to use mask:

df = df1.mask((df1['month'].ge(5) & df1['Year'].eq(2020)) | df1['Year'].ge(2021), df2)

Output:

   Id  Value  month  Year
0   1    672      4  2020
1   1    353      6  2020
2   2    682      6  2019
3   3    363      4  2021

CodePudding user response:

Assuming the dataframes are aligned, you can use combine_first:

(df1
 .where(df1['Year'].mul(100).add(df1['month']).lt(202004))
 .combine_first(df2)
 .convert_dtypes()
 )

Or more simply:

(df1
 .where(df1['Year'].mul(100).add(df1['month']).lt(202004), df2)
 )

Output:

   Id  Value  month  Year
0   1    671      4  2020
1   1    353      6  2020
2   2    682      6  2019
3   3    363      4  2021

Otherwise your approach is correct, just use concat in place of append

df1 = df1[(df1['year']*100   df1['month']) =< 202004]
df2 = df2[(df2['year']*100   df2['month']) => 202005]
df = pd.concat([df1,df2])
  • Related