Home > Back-end >  merging data frames with multiple conditions
merging data frames with multiple conditions

Time:08-27

I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution

df1.year<=df2.year(same or later year to be manufactured)

df1.maker=df2.maker AND df1.location=df2.location

I prepared a small mock data to explain:

first data frame:

data = np.array([[2014,"toyota","california","corolla"], 
                 [2015,"honda"," california", "civic"], 
                 [2020,"hyndai","florida","accent"], 
                 [2017,"nissan","NaN", "sentra"]]) 


df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df

second data frame:

data2 = np.array([[2012,"toyota","california","airbag"], 
                 [2017,"toyota","california", "wheel"], 
                 [2022,"hyndai","newyork","seat"], 
                 [2017,"nissan","london", "light"]]) 
 

df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2

desired output:

data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]]) 

df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3

I tried to use the below approach but it is to slow and also not so accurate:

df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)

CodePudding user response:

You can achieve it with a merge_asof (one to one left merge) and dropna:

# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])

(pd.merge_asof(df.sort_values('year'),
               df2.sort_values('year')
                  .assign(year2=df2['year']),
               on='year', by=['make', 'location'],
               direction='forward')
   .dropna(subset='id')
   .convert_dtypes('year2')
 )

NB. The intermediate is the size of df.

Output:

   year    make    location    model     id  year2
0  2014  toyota  california  corolla  wheel   2017

one to many

As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.

I added an extra row for 2017 to demonstrate the difference.

   year    make    location          id
0  2012  toyota  california      airbag
1  2017  toyota  california       wheel
2  2017  toyota  california  windshield
3  2022  hyndai     newyork        seat
4  2017  nissan      london       light

Right join:

(pd.merge_asof(df2.sort_values('year'),
               df.sort_values('year'),
               on='year', by=['make', 'location'],
               direction='backward')
   .dropna(subset='model')
 )

NB. The intermediate is the size of df2.

Output:

   year    make    location          id    model
1  2017  toyota  california       wheel  corolla
2  2017  toyota  california  windshield  corolla

CodePudding user response:

this should work:

df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()

Output:

    year_x  make    location    model   year_y  id
1   2014    toyota  california  corolla 2017    wheel

CodePudding user response:

Try this,

  • Code :

    Here 'make' and 'location' are common columns

    df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner') df3 = df_outer[df['year'] <= df2['year']]

  • Related