Home > Enterprise >  Merge three dataframe, count the matches and add new columns
Merge three dataframe, count the matches and add new columns

Time:10-12

I have 3 dataframes (df1, df2, df3) i want to merge these dataframe based on a coloumn and add two new columns. one column should say which dataframes are matching, second how many of them matched.

# df1
data = {'ID': ["M1", "M2", "M3", "M4"],
        'Movie': ["Top gun", "Thor", "Batman", "MadMax"],
        'Actor' : ["Tom", "Chris", "Bale", "Tom"],
        'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)

# df2
data = {'ID': ["M1", "M2", "M3"],
        'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)

# df3
data = {'ID': ["M2", "M3"],
       'no of parts': [3, 3],
       'co-star' : ["portman", "neeson"],
       'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)

Expected output will be

enter image description here

The match and no of match are the new column

Thank you for your time

Any help would be much appreciated

CodePudding user response:

You can merge your three dataframes on ID, then use the indicator parameter to merge to determine which dataframes had valid data, using this info to generate the match column. You can then count the number of | characters in match to determine the No of match column:

import pandas as pd

data = {'ID': ["M1", "M2", "M3", "M4"], 'Movie': ["Top gun", "Thor", "Batman", "MadMax"], 'Actor' : ["Tom", "Chris", "Bale", "Tom"], 'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)

data = {'ID': ["M1", "M2", "M3"], 'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)

data = {'ID': ["M2", "M3"], 'no of parts': [3, 3], 'co-star' : ["portman", "neeson"], 'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)

df = df1.merge(df2, on='ID', how='left', indicator='df1df2').merge(df3, on='ID', how='left',indicator='df3')
df['match'] = df['df1df2'].map({'both':'df1|df2', 'left_only':'df1'}) df['df3'].map({'both':'|df3', 'left_only':''})
df['No of match'] = df['match'].str.count('\|') 1

df = df.drop(['df1df2', 'df3'], axis=1)

Output:

   ID    Movie  Actor    type         highlight  no of parts  co-star award        match  No of match
0  M1  Top gun    Tom  Action     Flight school          NaN      NaN   NaN      df1|df2            2
1  M2     Thor  Chris   SciFi  Love and thunder          3.0  portman   yes  df1|df2|df3            3
2  M3   Batman   Bale   Comic       I am Batman          3.0   neeson   yes  df1|df2|df3            3
3  M4   MadMax    Tom   SciFi               NaN          NaN      NaN   NaN          df1            1

CodePudding user response:

Use DataFrame.merge with left join and indicator parameters for see matched DataFrames, then use DataFrame.pop for remove column with processing by Series.map for dictionaries, append df3 column with mapping another dictionary and last count | by Series.str.count:

df = (df1.merge(df2, on='ID', how='left', indicator='df2')
         .merge(df3, on='ID', how='left', indicator='df3'))

df['match'] = (df.pop('df2').map({'both':'df1|df2', 'left_only':'df1'})  
               df.pop('df3').map({'both':'|df3', 'left_only':''}))
df['No of match'] = df['match'].str.count('\|')   1
print (df)
   ID    Movie  Actor    type         highlight  no of parts  co-star award  \
0  M1  Top gun    Tom  Action     Flight school          NaN      NaN   NaN   
1  M2     Thor  Chris   SciFi  Love and thunder          3.0  portman   yes   
2  M3   Batman   Bale   Comic       I am Batman          3.0   neeson   yes   
3  M4   MadMax    Tom   SciFi               NaN          NaN      NaN   NaN   

         match  No of match  
0      df1|df2            2  
1  df1|df2|df3            3  
2  df1|df2|df3            3  
3          df1            1  

CodePudding user response:

You can try this one too; merging with reduce-lambda

dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ID',how='outer'), dfs)
df_temp = df_final[[df1.columns[1],df2.columns[1],df3.columns[1]]]
df_final["match"] = df_temp.apply(lambda x: "|".join(["df" str(idx 1) for idx,i in enumerate(x) if pd.isna(i)==False]),axis=1)
df_final["No of match"] = df_final["match"].apply(lambda x: x.count("|") 1)

Output;

   ID    Movie  Actor    type  ...  co-star  award        match No of match
0  M1  Top gun    Tom  Action  ...      NaN    NaN      df1|df2           2
1  M2     Thor  Chris   SciFi  ...  portman    yes  df1|df2|df3           3
2  M3   Batman   Bale   Comic  ...   neeson    yes  df1|df2|df3           3
3  M4   MadMax    Tom   SciFi  ...      NaN    NaN          df1           1

CodePudding user response:

You can use pandas.concat on a list of the input DafaFrames. This will work on any number of input DataFrames (not just 3):

# dataframes will be later named in order: 1->2->3
# you can easily tweak this solution to use a dictionary
# and custom names if desired
dfs = [df1, df2, df3]

out = (pd
 .concat([d.set_index('ID').assign(ID=f'df{i}')
          for i,d in enumerate(dfs, start=1)], axis=1)
 .assign(**{'match': lambda d: d[['ID']].agg(lambda x: '|'.join(x.dropna()),
                                             axis=1),
            'No of matches': lambda d: d[['ID']].notna().sum(axis=1)
           })
 .drop('ID', axis=1).reset_index()
)

NB. this approach uses a temporary ID column, make sure it is not present in any of the input DataFrame's column. You can chose another name for safety if needed.

output:

   ID    Movie  Actor    type         highlight  no of parts  co-star award        match  No of matches
0  M1  Top gun    Tom  Action     Flight school          NaN      NaN   NaN      df1|df2              2
1  M2     Thor  Chris   SciFi  Love and thunder          3.0  portman   yes  df1|df2|df3              3
2  M3   Batman   Bale   Comic       I am Batman          3.0   neeson   yes  df1|df2|df3              3
3  M4   MadMax    Tom   SciFi               NaN          NaN      NaN   NaN          df1              1

CodePudding user response:

May not be the best way, but you can try this:

import numpy as np

def f(x):
  if (str(x[1])== 'nan') & (str(x[2])==  'nan') & (str(x[3])==  'nan') :
    if len(x[0]) ==1:
      return (['df1'])
    if len(x[0]) ==2:
      return (['df1','df2'])
  else:
    return (['df1','df2','df3'])

df = df1.merge(df2,how='outer')
df['match'] = df['highlight'].apply(lambda x: ['df1','df2'] if str(x)!= 'nan' else ['df1'])
df = df.merge(df3,how='outer')
df['match'] = df[['match','no of parts','co-star','award']].apply(f,axis=1)
df['No of match'] = df['match'].apply(lambda x:len(x))
df['match'] = df['match'].apply(lambda x:'|'.join(x))
df = df[['ID', 'Movie', 'Actor', 'type', 'highlight', 'no of parts',
       'co-star', 'award','match', 'No of match']]
  • Related