Home > Back-end >  Matching in python
Matching in python

Time:07-14

I have two dataframes and want to match both dataframes on below conditions

  • State matches between df1 and df2
  • pre_year count matches. i.e if the pre_year is "2018" then column year_2018 in df1 and df2 should match

If match is found in df2 then I want to create a new df with all info from df1. From df2 I want to get ID and post year count (if pre year was 2018, post year would be 2020, if more than 2020 then NA)

Df1

ID state pre_year year_2018 year_2019 year_2020
100A GA 2018 10 9 7
300A FL 2020 5 2 6

Df2

ID state year_2018 year_2019 year_2020
500A GA 10 0 0
600A NY 0 3 0
700A FL 0 0 0
800A GA 10 4 1

expected Final Df

df_1_ID df_1_state df_1_pre_year df_1_year_2018 df_1_year_2019 df_1_year_2020 df_2_match_ID df_2_post_year
100A GA 2018 10 9 7 500A 0
100A GA 2018 10 9 7 800A 1

I started with loop but I can't figure out how to match the pre year count

df1 = pd.DataFrame({'ID' : ['100A', '300A'],
                   'state':['GA', 'FL'],
                   'pre_year':[2018, 2020],
                   'year_2018':[10, 5],
                   'year_2019':[9, 2],
                    'year_2020':[7, 6]
                   })

df2 = pd.DataFrame({'ID' : ['500A', '600A', '700A', '800A'],
                   'state':['GA', 'NY', 'FL','GA'],
                   'year_2018':[10, 0,0,10],
                   'year_2019':[0,3,0,4],
                    'year_2020':[0,0,0,1]
                   })


CodePudding user response:

There are probably some different solutions for your problem. I would first pull the years as a column and the values as a separate with

df2 = df2.melt(id_vars=['ID', 'state'], 
               var_name="year", 
               value_name="value")
df2.year = df2.year.str.replace('year_', '').astype(int)
df2['pre_year'] = df2.year - 2  # calculate the pre_year for the later pd.merge
ID state year value pre_year
0 500A GA 2018 10 2016
1 600A NY 2018 0 2016
... ... ... ... ... ...

with the years column, you can now implement your logic, e.g. with looping over the rows.

You can also add a column with the calculated pre_year (done above) and merge on ['state', 'pre_year'] with and test if pre_year 2=year:

df = df1.merge(df2, on=['state', 'year'], how='left', suffixes=('_df1', '_df2'))
df = df[df.pre_year 2 == df.year]
ID_df1 state pre_year year_2018 year_2019 year_2020 ID_df2 year value
0 100A GA 2018 10 9 7 500A 2020 0
1 100A GA 2018 10 9 7 800A 2020 1

If needed you can rename the columns or drop columns in a next step.

  • Related