Home > other >  How does rsuffix and lsuffix work while joining multiple dataframes?
How does rsuffix and lsuffix work while joining multiple dataframes?

Time:04-08

I have written the following code however I am unable to understand how to name the rsuffix and lsuffix parameters

dfs_list = []
for cycle in email_df.cycle_end_date.unique():
    temp = email_df[email_df.cycle_end_date == cycle].transpose()\
                    .join(flash_df[flash_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')\
                    .join(sms_df[sms_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')\
                    .join(upi_df[upi_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuf', rsuffix='rsuf')\
                    .join(ivr_df[ivr_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuff', rsuffix='rsuff')
    dfs_list.append(temp)

All my dfs have same column names

example:

cycle_end_date | triggered | delivered | cost | payment_value | delivery%
2021-15-01  | 34 | 32 | 4 | 7899 | 5%
2021-31-01  | 45 | 49 | 8 | 1500 | 4%

When I am printing dfs_list[2].reset_index() I do get my expected output but I am unable to comprehend the suffix names. How do we define it?

output:

**index |   2email |    1lsuff |    2flash |    2   | 1rsuff**
0   absolute_cost   3.00    9.40    9.40    0.00    6.00
1   bill_paid_percent   3.28    0.33    1.87    68139.72    0.28
2   bill_paid_using_reminder    21.20   0.70    9.45    1.78    0.64
3   bounced_email   5018    NaN NaN NaN NaN
4   clicked_email   13385   NaN NaN NaN NaN
5   cycle_end_date  2022-02-28  2022-02-28  2022-02-28  2022-02-28  2022-02-28

Can someone throw light on how to name the suffix to get the exact numbers of the dataframe considered?

CodePudding user response:

lsuffix and rsuffix only work when there are overlapping columns between joined dataframes.

Let's look at the continous join in your script

temp = email_df[email_df.cycle_end_date == cycle].transpose()\
         .join(flash_df[flash_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')\
         .join(sms_df[sms_df.cycle_end_date == cycle].transpose(), how='outer', lsuffix='email', rsuffix='flash')\
         .join(upi_df[upi_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuf', rsuffix='rsuf')\
         .join(ivr_df[ivr_df.cycle_end_date==cycle].transpose(),how='outer',lsuffix='lsuff', rsuffix='rsuff')

Since you do a transpose to all your joined dataframe, so actually you are using dataframe where column names are original index.

                         1
cycle_end_date  2021-31-01
triggered               45
delivered               49
cost                     8
payment_value         1500
delivery%               4%

After you join email_df with flash_df, the 1 might overlap, so the joined df is

                    1email       1flash
cycle_end_date  2021-15-01  2021-15-01
triggered               34          34
delivered               32          32
cost                     4           4
payment_value         7899        7899
delivery%               5%          5%

In the next join with sms_df whose column name is index doesn't overlap with above joined df, so the output might like

                    0email      0flash           1
cycle_end_date  2021-15-01  2021-15-01  2021-15-01
triggered               34          34          34
delivered               32          32          32
cost                     4           4           4
payment_value         7899        7899        7899
delivery%               5%          5%          5%

And this process goes on...

  • Related