I have two dataframes and want to match both dataframes on below conditions
- State matches between df1 and df2
- pre_year count matches. i.e if the pre_year is "2018" then column year_2018 in df1 and df2 should match
If match is found in df2 then I want to create a new df with all info from df1. From df2 I want to get ID and post year count (if pre year was 2018, post year would be 2020, if more than 2020 then NA)
Df1
ID | state | pre_year | year_2018 | year_2019 | year_2020 |
---|---|---|---|---|---|
100A | GA | 2018 | 10 | 9 | 7 |
300A | FL | 2020 | 5 | 2 | 6 |
Df2
ID | state | year_2018 | year_2019 | year_2020 |
---|---|---|---|---|
500A | GA | 10 | 0 | 0 |
600A | NY | 0 | 3 | 0 |
700A | FL | 0 | 0 | 0 |
800A | GA | 10 | 4 | 1 |
expected Final Df
df_1_ID | df_1_state | df_1_pre_year | df_1_year_2018 | df_1_year_2019 | df_1_year_2020 | df_2_match_ID | df_2_post_year |
---|---|---|---|---|---|---|---|
100A | GA | 2018 | 10 | 9 | 7 | 500A | 0 |
100A | GA | 2018 | 10 | 9 | 7 | 800A | 1 |
I started with loop but I can't figure out how to match the pre year count
df1 = pd.DataFrame({'ID' : ['100A', '300A'],
'state':['GA', 'FL'],
'pre_year':[2018, 2020],
'year_2018':[10, 5],
'year_2019':[9, 2],
'year_2020':[7, 6]
})
df2 = pd.DataFrame({'ID' : ['500A', '600A', '700A', '800A'],
'state':['GA', 'NY', 'FL','GA'],
'year_2018':[10, 0,0,10],
'year_2019':[0,3,0,4],
'year_2020':[0,0,0,1]
})
CodePudding user response:
There are probably some different solutions for your problem. I would first pull the years as a column and the values as a separate with
df2 = df2.melt(id_vars=['ID', 'state'],
var_name="year",
value_name="value")
df2.year = df2.year.str.replace('year_', '').astype(int)
df2['pre_year'] = df2.year - 2 # calculate the pre_year for the later pd.merge
ID | state | year | value | pre_year | |
---|---|---|---|---|---|
0 | 500A | GA | 2018 | 10 | 2016 |
1 | 600A | NY | 2018 | 0 | 2016 |
... | ... | ... | ... | ... | ... |
with the years column, you can now implement your logic, e.g. with looping over the rows.
You can also add a column with the calculated pre_year
(done above) and merge
on ['state', 'pre_year']
with and test if pre_year 2=year
:
df = df1.merge(df2, on=['state', 'year'], how='left', suffixes=('_df1', '_df2'))
df = df[df.pre_year 2 == df.year]
ID_df1 | state | pre_year | year_2018 | year_2019 | year_2020 | ID_df2 | year | value | |
---|---|---|---|---|---|---|---|---|---|
0 | 100A | GA | 2018 | 10 | 9 | 7 | 500A | 2020 | 0 |
1 | 100A | GA | 2018 | 10 | 9 | 7 | 800A | 2020 | 1 |
If needed you can rename the columns or drop columns in a next step.