pandas: append rows to another dataframe under the similar row based on multiple columns-CodePudding

I asked quite a similar question here but was wondering if there is a way to tackle the issue if one has to rely on multiple columns to perform the append. So the dataframes look as follows,

    import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
 'col4': ['min', 'max', 'mid','min'],
 'col5': ['min', 'max', 'max','max']}

d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
     'col4': ['min', 'max', 'max'],
     'col5': ['max', 'min', 'mid']}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

so this time, i would like to append the rows from df2 under similar rows in df1 only if the rows are similar in all col1, col2, col3. so the output is,

    col1         col2          col3       col4 col5
 0 I ate dinner  I ate dinner  I went out  min  min
 1 I ate dinner  I ate dinner  I went out  min  max
 2  the play was inetresting the gold is shining  the house is nice  min  max
 3  the play was inetresting the gold is shining  the house is nice  max  mid

so I tried the following,

df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)

but I get this error,

 TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

CodePudding user response：

Another solution is to use pd.merge and pd.wide_to_long:

out = (
    pd.wide_to_long(
        pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
        stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
      .sort_index().reset_index(drop=True)[df1.columns]
)

Output:

>>> out
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

Step by step

# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
   index                      col1                 col2               col3 col4_x col5_x col4_y col5_y
0      0              I ate dinner         I ate dinner         I went out    min    min    min    max
1      1  the play was inetresting  the gold is shining  the house is nice    min    max    max    mid

# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
                        col3                 col2                      col1 col4 col5
index val                                                                            
0     x           I went out         I ate dinner              I ate dinner  min  min
1     x    the house is nice  the gold is shining  the play was inetresting  min  max
0     y           I went out         I ate dinner              I ate dinner  min  max
1     y    the house is nice  the gold is shining  the play was inetresting  max  mid

# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

CodePudding user response：

ok, I realized my own mistake and will post an answer here in case it might be interesting for anyone, (the answer is based on the link in the question)

print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))

CodePudding user response：

I strongly advice you to propose numerics minimum example instead of text based ones. Easier to read, easier to understand. That being said, if I understand correctly, you want for each row of df1 :

check if there are some rows from df2 having the same value on some cols.
append these rows to df1, right behind the said row.

Of course, we can discuss the case of duplicates in df1, and how you want to process them. Then, we can write two solutions, one using a for loop and an other using functional programming from Pandas (depends on your skills, habits, and other preferences).

A for-loop Approach

Let's assume there is no duplicate in df1, then :

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

cols = ["col1", "col2", "col3"]

## create a new dataframe
new_df = pd.DataFrame()

## loop over all rows of df1
for _, row1 in df1.iterrows():

    ## check for equality with df2, we need all elements cols being equal to select these rows
    is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)

    ## if at least one row of df2 is equal to row, append them
    if is_eq_to_row_1.any():
        ## first append the row1
        new_df = new_df.append(row1) 
        ## then all rows of df2 equal to row1
        new_df = new_df.append(df2[is_eq_to_row_1])

A functional approach

I didn't have time yet to write a proper solution, but I guess it implies groupby, apply and a bunch of Pandas related functions. For each row x, we still used df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)] to select df2's rows equal to x.

We just to "loop" over all rows. A designed tool could be groupby. Then we don't care about duplicates anymore.

new_df = df1.groupby(cols). \
         apply(lambda x : pd.concat([x, 
                                     df2[df2.eq(x.iloc[0, :].loc[cols]). \
                                         loc[:, cols].all(axis=1)]]))

There are still some work to do to not append rows if no df2's row was found, and to clean up the output.