Home > Back-end >  Pandas conditional join
Pandas conditional join

Time:12-13

I have a df like the one below

df = pd.DataFrame([[1,1], [2,1], [3,1], [1,2], [2,2], [3,2]], columns = ['Var1', 'Var2'])

Var1    Var2
1       1
2       1
3       1
1       2
2       2
3       2

I would like to merge the df on itself and create a new variable. In SQL I would achieve it by the following statement

 a.*
,b.var2 as Var3
from df as a
left join
df as b
on a.var1 = b.var1 and a.var2 1 = b.var2

To achieve the following result

Var1    Var2    Var3
1       1       2
2       1       2
3       1       2
1       2   
2       2   
3       2   

When trying to do this in pandas I use the following statement

df['Var3'] = df.merge(df['Var2'], how='left', left_on=['Var1', 'Var2' 1], right_on=['Var1', 'Var2'])

However it throws the error below at me as it thinks that I want to concatenate the 'Var2' rather than adding on 1 to it.

TypeError: can only concatenate str (not "int") to str 

How do I go about achieving this conditional join on a dataframe in pandas?

Thanks!

CodePudding user response:

Try this:

df['Var3'] = df.assign(tmp=df['Var2'] 1).merge(df, how='left', left_on=['Var1', 'tmp'], right_on=['Var1', 'Var2'])['Var2_y']

Output:

>>> df
   Var1  Var2  Var3
0     1     1   2.0
1     2     1   2.0
2     3     1   2.0
3     1     2   NaN
4     2     2   NaN
5     3     2   NaN

df.assign(tmp=df['Var2'] 1) creates a temporary column containing all the values of the Var2 incremented by one. You can't do let_on=['Var1', 'Var2' 1] in Python, so you can use the assign trick instead.

Explanation for Var2_y:

The intermediate dataframe freshly after merging looks like this:

>>> df.assign(tmp=df['Var2'] 1).merge(df, how='left', left_on=['Var1', 'tmp'], right_on=['Var1', 'Var2'])
   Var1  Var2_x  tmp  Var2_y
0     1       1    2     2.0
1     2       1    2     2.0
2     3       1    2     2.0
3     1       2    3     NaN
4     2       2    3     NaN
5     3       2    3     NaN

Some columns are already existing, so instead of overwriting them, pandas renames the new columns to _x and _y. The column we want is the Var2_y one, so that's the one we select.

  • Related