I have a df like the one below
df = pd.DataFrame([[1,1], [2,1], [3,1], [1,2], [2,2], [3,2]], columns = ['Var1', 'Var2'])
Var1 Var2
1 1
2 1
3 1
1 2
2 2
3 2
I would like to merge the df on itself and create a new variable. In SQL I would achieve it by the following statement
a.*
,b.var2 as Var3
from df as a
left join
df as b
on a.var1 = b.var1 and a.var2 1 = b.var2
To achieve the following result
Var1 Var2 Var3
1 1 2
2 1 2
3 1 2
1 2
2 2
3 2
When trying to do this in pandas I use the following statement
df['Var3'] = df.merge(df['Var2'], how='left', left_on=['Var1', 'Var2' 1], right_on=['Var1', 'Var2'])
However it throws the error below at me as it thinks that I want to concatenate the 'Var2' rather than adding on 1 to it.
TypeError: can only concatenate str (not "int") to str
How do I go about achieving this conditional join on a dataframe in pandas?
Thanks!
CodePudding user response:
Try this:
df['Var3'] = df.assign(tmp=df['Var2'] 1).merge(df, how='left', left_on=['Var1', 'tmp'], right_on=['Var1', 'Var2'])['Var2_y']
Output:
>>> df
Var1 Var2 Var3
0 1 1 2.0
1 2 1 2.0
2 3 1 2.0
3 1 2 NaN
4 2 2 NaN
5 3 2 NaN
df.assign(tmp=df['Var2'] 1)
creates a temporary column containing all the values of the Var2
incremented by one. You can't do let_on=['Var1', 'Var2' 1]
in Python, so you can use the assign
trick instead.
Explanation for Var2_y
:
The intermediate dataframe freshly after merging looks like this:
>>> df.assign(tmp=df['Var2'] 1).merge(df, how='left', left_on=['Var1', 'tmp'], right_on=['Var1', 'Var2'])
Var1 Var2_x tmp Var2_y
0 1 1 2 2.0
1 2 1 2 2.0
2 3 1 2 2.0
3 1 2 3 NaN
4 2 2 3 NaN
5 3 2 3 NaN
Some columns are already existing, so instead of overwriting them, pandas renames the new columns to _x
and _y
. The column we want is the Var2_y
one, so that's the one we select.