I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)
CodePudding user response:
You can achieve it with a merge_asof
(one to one left merge) and dropna
:
# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])
(pd.merge_asof(df.sort_values('year'),
df2.sort_values('year')
.assign(year2=df2['year']),
on='year', by=['make', 'location'],
direction='forward')
.dropna(subset='id')
.convert_dtypes('year2')
)
NB. The intermediate is the size of df
.
Output:
year make location model id year2
0 2014 toyota california corolla wheel 2017
one to many
As merge_asof
is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.
I added an extra row for 2017 to demonstrate the difference.
year make location id
0 2012 toyota california airbag
1 2017 toyota california wheel
2 2017 toyota california windshield
3 2022 hyndai newyork seat
4 2017 nissan london light
Right join:
(pd.merge_asof(df2.sort_values('year'),
df.sort_values('year'),
on='year', by=['make', 'location'],
direction='backward')
.dropna(subset='model')
)
NB. The intermediate is the size of df2
.
Output:
year make location id model
1 2017 toyota california wheel corolla
2 2017 toyota california windshield corolla
CodePudding user response:
this should work:
df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()
Output:
year_x make location model year_y id
1 2014 toyota california corolla 2017 wheel
CodePudding user response:
Try this code (here 'make'
and 'location'
are common columns):
df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner')
df3 = df_outer[df['year'] <= df2['year']]