I asked quite a similar question here but was wondering if there is a way to tackle the issue if one has to rely on multiple columns to perform the append. So the dataframes look as follows,
import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
'col4': ['min', 'max', 'mid','min'],
'col5': ['min', 'max', 'max','max']}
d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
'col4': ['min', 'max', 'max'],
'col5': ['max', 'min', 'mid']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
so this time, i would like to append the rows from df2 under similar rows in df1 only if the rows are similar in all col1, col2, col3. so the output is,
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
so I tried the following,
df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)
but I get this error,
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
CodePudding user response:
Another solution is to use pd.merge
and pd.wide_to_long
:
out = (
pd.wide_to_long(
pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
.sort_index().reset_index(drop=True)[df1.columns]
)
Output:
>>> out
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
Step by step
# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
index col1 col2 col3 col4_x col5_x col4_y col5_y
0 0 I ate dinner I ate dinner I went out min min min max
1 1 the play was inetresting the gold is shining the house is nice min max max mid
# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
col3 col2 col1 col4 col5
index val
0 x I went out I ate dinner I ate dinner min min
1 x the house is nice the gold is shining the play was inetresting min max
0 y I went out I ate dinner I ate dinner min max
1 y the house is nice the gold is shining the play was inetresting max mid
# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
CodePudding user response:
ok, I realized my own mistake and will post an answer here in case it might be interesting for anyone, (the answer is based on the link in the question)
print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))
CodePudding user response:
I strongly advice you to propose numerics minimum example instead of text based ones. Easier to read, easier to understand. That being said, if I understand correctly, you want for each row of df1 :
- check if there are some rows from df2 having the same value on some cols.
- append these rows to df1, right behind the said row.
Of course, we can discuss the case of duplicates in df1, and how you want to process them. Then, we can write two solutions, one using a for loop and an other using functional programming from Pandas (depends on your skills, habits, and other preferences).
A for-loop Approach
Let's assume there is no duplicate in df1, then :
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
cols = ["col1", "col2", "col3"]
## create a new dataframe
new_df = pd.DataFrame()
## loop over all rows of df1
for _, row1 in df1.iterrows():
## check for equality with df2, we need all elements cols being equal to select these rows
is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)
## if at least one row of df2 is equal to row, append them
if is_eq_to_row_1.any():
## first append the row1
new_df = new_df.append(row1)
## then all rows of df2 equal to row1
new_df = new_df.append(df2[is_eq_to_row_1])
A functional approach
I didn't have time yet to write a proper solution, but I guess it implies groupby, apply and a bunch of Pandas related functions. For each row x, we still used df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)]
to select df2's rows equal to x.
We just to "loop" over all rows. A designed tool could be groupby. Then we don't care about duplicates anymore.
new_df = df1.groupby(cols). \
apply(lambda x : pd.concat([x,
df2[df2.eq(x.iloc[0, :].loc[cols]). \
loc[:, cols].all(axis=1)]]))
There are still some work to do to not append rows if no df2's row was found, and to clean up the output.