I have 3 dataframes (df1, df2, df3)
i want to merge these dataframe based on a coloumn and add two new columns. one column should say which dataframes are matching, second how many of them matched.
# df1
data = {'ID': ["M1", "M2", "M3", "M4"],
'Movie': ["Top gun", "Thor", "Batman", "MadMax"],
'Actor' : ["Tom", "Chris", "Bale", "Tom"],
'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
# df2
data = {'ID': ["M1", "M2", "M3"],
'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
# df3
data = {'ID': ["M2", "M3"],
'no of parts': [3, 3],
'co-star' : ["portman", "neeson"],
'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
Expected output will be
The match
and no of match
are the new column
Thank you for your time
Any help would be much appreciated
CodePudding user response:
You can merge
your three dataframes on ID
, then use the indicator
parameter to merge
to determine which dataframes had valid data, using this info to generate the match
column. You can then count the number of |
characters in match
to determine the No of match
column:
import pandas as pd
data = {'ID': ["M1", "M2", "M3", "M4"], 'Movie': ["Top gun", "Thor", "Batman", "MadMax"], 'Actor' : ["Tom", "Chris", "Bale", "Tom"], 'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
data = {'ID': ["M1", "M2", "M3"], 'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
data = {'ID': ["M2", "M3"], 'no of parts': [3, 3], 'co-star' : ["portman", "neeson"], 'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
df = df1.merge(df2, on='ID', how='left', indicator='df1df2').merge(df3, on='ID', how='left',indicator='df3')
df['match'] = df['df1df2'].map({'both':'df1|df2', 'left_only':'df1'}) df['df3'].map({'both':'|df3', 'left_only':''})
df['No of match'] = df['match'].str.count('\|') 1
df = df.drop(['df1df2', 'df3'], axis=1)
Output:
ID Movie Actor type highlight no of parts co-star award match No of match
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
CodePudding user response:
Use DataFrame.merge
with left join and indicator parameter
s for see match
ed DataFrames, then use DataFrame.pop
for remove column with processing by Series.map
for dictionaries, append df3
column with mapping another dictionary and last count |
by Series.str.count
:
df = (df1.merge(df2, on='ID', how='left', indicator='df2')
.merge(df3, on='ID', how='left', indicator='df3'))
df['match'] = (df.pop('df2').map({'both':'df1|df2', 'left_only':'df1'})
df.pop('df3').map({'both':'|df3', 'left_only':''}))
df['No of match'] = df['match'].str.count('\|') 1
print (df)
ID Movie Actor type highlight no of parts co-star award \
0 M1 Top gun Tom Action Flight school NaN NaN NaN
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes
3 M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
0 df1|df2 2
1 df1|df2|df3 3
2 df1|df2|df3 3
3 df1 1
CodePudding user response:
You can try this one too; merging with reduce-lambda
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ID',how='outer'), dfs)
df_temp = df_final[[df1.columns[1],df2.columns[1],df3.columns[1]]]
df_final["match"] = df_temp.apply(lambda x: "|".join(["df" str(idx 1) for idx,i in enumerate(x) if pd.isna(i)==False]),axis=1)
df_final["No of match"] = df_final["match"].apply(lambda x: x.count("|") 1)
Output;
ID Movie Actor type ... co-star award match No of match
0 M1 Top gun Tom Action ... NaN NaN df1|df2 2
1 M2 Thor Chris SciFi ... portman yes df1|df2|df3 3
2 M3 Batman Bale Comic ... neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi ... NaN NaN df1 1
CodePudding user response:
You can use pandas.concat
on a list of the input DafaFrames. This will work on any number of input DataFrames (not just 3):
# dataframes will be later named in order: 1->2->3
# you can easily tweak this solution to use a dictionary
# and custom names if desired
dfs = [df1, df2, df3]
out = (pd
.concat([d.set_index('ID').assign(ID=f'df{i}')
for i,d in enumerate(dfs, start=1)], axis=1)
.assign(**{'match': lambda d: d[['ID']].agg(lambda x: '|'.join(x.dropna()),
axis=1),
'No of matches': lambda d: d[['ID']].notna().sum(axis=1)
})
.drop('ID', axis=1).reset_index()
)
NB. this approach uses a temporary ID
column, make sure it is not present in any of the input DataFrame's column. You can chose another name for safety if needed.
output:
ID Movie Actor type highlight no of parts co-star award match No of matches
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1
CodePudding user response:
May not be the best way, but you can try this:
import numpy as np
def f(x):
if (str(x[1])== 'nan') & (str(x[2])== 'nan') & (str(x[3])== 'nan') :
if len(x[0]) ==1:
return (['df1'])
if len(x[0]) ==2:
return (['df1','df2'])
else:
return (['df1','df2','df3'])
df = df1.merge(df2,how='outer')
df['match'] = df['highlight'].apply(lambda x: ['df1','df2'] if str(x)!= 'nan' else ['df1'])
df = df.merge(df3,how='outer')
df['match'] = df[['match','no of parts','co-star','award']].apply(f,axis=1)
df['No of match'] = df['match'].apply(lambda x:len(x))
df['match'] = df['match'].apply(lambda x:'|'.join(x))
df = df[['ID', 'Movie', 'Actor', 'type', 'highlight', 'no of parts',
'co-star', 'award','match', 'No of match']]