Home > database >  Pandas: how to make a join based on a condition between two columns of 2 seperate dataframes
Pandas: how to make a join based on a condition between two columns of 2 seperate dataframes

Time:05-19

Sorry for novice question. I just have no idea how to approach it.

I have 2 dataframes that I would like to join and create another dataframe based on the condition where df1['D'] - df2['D'] results in 1.

df1:

df1 = pd.DataFrame(
{
    "A": 1.0,
    "B": pd.date_range('2015-02-24', periods=4, freq='D'),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([5] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
}

)

and df 2

df2 = pd.DataFrame(
    {
        "B": pd.date_range('2015-02-24', periods=4, freq='D'),
        "Y": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([1,2,3,4]),
        
    }
)

that look like this

df1:

      A          B        C     D     E       F
0   1.00    2015-02-24  1.00    5   test    foo
1   1.00    2015-02-25  1.00    5   train   foo
2   1.00    2015-02-26  1.00    5   test    foo
3   1.00    2015-02-27  1.00    5   train   foo

df2:

        B        Y      D
0   2015-02-24  1.00    1
1   2015-02-25  1.00    2
2   2015-02-26  1.00    3
3   2015-02-27  1.00    4

desired result:

df3

     A           B        C     D     E       F   Y     D
    1.00    2015-02-27  1.00    5   train   foo  1.00   4

I have tried solutions from these questions:

Pandas: Join dataframe with condition

How to do/workaround a conditional join in python Pandas?

pandas join dataframes based on conditions

and while they do resolve the issue for this particular question it is not exactly what I want to do.

What i want to do is to create condition df1['D'] - df2['D'] and then make python check each row of both dataframes and only then create a left join on date.

SQL version of this is:

SELECT *
FROM df1
LEFT JOIN df2
ON df1.b = df2.b
WHERE df1.d - df2.d = 1;

P.S. I apologise in advance if I cant upvote the answers. still dont have enough rep =(

CodePudding user response:

one liner:

df3 = pd.merge(df1,df2, on="B")\
    .where((df1.D-df2.D)==1)\
    .dropna()\
    .reset_index(drop=True)

out:

     A          B    C  D_x      E    F    Y  D_y
0  1.0 2015-02-27  1.0  5.0  train  foo  1.0  4.0
  • Related