Home > Software design >  merge two dataframe that share the same column value
merge two dataframe that share the same column value

Time:11-04

I have two dataframe df1 =

C L O D LN M x y z w
FR 312 73 2021-04-09 1 W 1 0 0 0
FR 312 73 2021-04-09 1 W 0 1 0 0
FR 312 73 2021-04-09 1 W 0 0 1 0
FR 312 73 2021-04-09 3 W 1 0 0 0
FR 312 73 2021-04-09 3 W 0 1 0 0
FR 312 73 2021-04-09 3 W 0 0 1 0
FR 312 73 2021-04-09 4 W 1 0 0 0
FR 312 73 2021-04-09 4 W 0 1 0 0
FR 312 73 2021-04-09 4 W 0 0 1 0
FR 312 73 2021-04-09 5 W 1 0 0 0
FR 312 73 2021-04-09 5 W 0 1 0 0

df2 =

C L O D LN M x y z w
FR 312 73 2021-04-09 1 L 1 0 0 1

I would like to concatenate or merge those 2 df on C | L | O | D | LN

output desired =

C L O D LN M x y z w
FR 312 73 2021-04-09 1 W 1 0 0 0
FR 312 73 2021-04-09 1 W 0 1 0 0
FR 312 73 2021-04-09 1 W 0 0 1 0
FR 312 73 2021-04-09 1 L 0 0 0 1

thanks for your help

CodePudding user response:

You could set up the indexes such that they align with the data you want to match:

df1.set_index(['C','L','O','D','LN'], inplace=True) 
df2.set_index(['C','L','O','D','LN'], inplace=True) 

then you could merge both dfs into a single df containing all records

df_concat = pd.concat([df1, df2], axis=0)

All thats left is to find the indices that exist in both dataframes, and use it on the concatenated df:

i2keep = set(df1.index.tolist()) and set(df2.index.tolist())
df_concat.loc[i2keep,:].reset_index()

yields

    C    L   O         D  LN  M  x  y  z  w                                                                             
0  FR  312  73  4/9/2021   1  W  1  0  0  0                                                                             
1  FR  312  73  4/9/2021   1  W  0  1  0  0                                                                             
2  FR  312  73  4/9/2021   1  W  0  0  1  0                                                                             
3  FR  312  73  4/9/2021   1  L  1  0  0  1 

CodePudding user response:

As I was trying to point out in the comments, a possible solution is:

  • duplicate df2, and drop columns that are not wanted (for merging)
  • use inner merge on df2_dupli and df1
  • possibly remove Nan rows, not sure if this step is required
  • combine df2 and df_merge (concat)
  • profit
# initialize
df1 = pd.DataFrame(data={'A': [1, 2, 3], 'B': ["abc", "abc", "cde"], 'C': [0, 1, 0]})
df2 = pd.DataFrame(data={'A': [1, 2], 'B': ["abc", "abc"], 'C': [5, 5]})

# duplicate and drop
df2_dupli = df2.copy()
df2_dupli = df2_dupli.drop(columns=['C'])

# inner merge
df_merge = pd.merge(df1, df2_dupli,  how='inner', left_on=['A','B'], right_on = ['A','B'])

# combine
df_out = pd.concat([df_merge, df2])
>>> df1
   A    B  C
0  1  abc  0
1  2  abc  1
2  3  cde  0

>>> df2
   A    B  C
0  1  abc  5
1  2  abc  5

>>> df_out
   A    B  C
0  1  abc  0
1  2  abc  1
0  1  abc  5
1  2  abc  5

df_out now contains all rows from df2, and rows from df1 that have the same values in specified columns as df2.

  • Related