Merging two different dataframes-CodePudding

I'm having an issue merging two different data frames based on column names.

Code:

import os, json, xlsxwriter
import pandas as pd

left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)

new_df = pd.merge(left, right,  how='inner', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right,  how='inner', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID")
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))

Current result:

left df:

CompId	WindowsOsVersion	MacOsVersion
Computer-8	7
Computer-D	11
Computer-4	XP
Computer-Z		Zebra

right df:

OsName	Upgrade	Cost
XP	7	£5
7	8	£10
11	none	£0
Zebra	Lion	£10

the outcome i want:

CompId	WindowsOsVersion	MacOsVersion	OsName	Upgrade	Cost
Computer-8	7		7	8	£10
Computer-D	11		11	none	£0
Computer-4	XP		XP	7	£5
Computer-Z		Zebra	Zebra	Lion	£10

Any help would be appreciated

Updated code:

import os, json, xlsxwriter
import pandas as pd

left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)

new_df = pd.merge(left, right,  how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right,  how='left', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID", how='outer',suffixes=('', '_y'))
for col in tester:
        if col.endswith('_x'):
            tester.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in tester if col.endswith('_y')]
            tester.drop(to_drop,axis=1,inplace=True)
        else:
            pass
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))

Current table:

CompId	WindowsOsVersion	MacOsVersion	OsName	Upgrade
Computer-8	7		7	8
Computer-D	11		11	none
Computer-4	XP		XP	7
Computer-Z		Zebra	NaN	NaN

I'm not sure why the last column is not displaying the correct information?

Should be Zebra, Lion

CodePudding user response：

In a simplistic way, you can do the following:

First create the merged dataframes.

new_df = pd.merge(left, right,  how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df = pd.merge(new_df, right,  how='left', left_on=['MacOsVersion'], right_on = ['OsName'])

The dataframe at this point looks like this:

       CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0  Computer-8                7                     7         8      NaN       NaN
1  Computer-D               11                    11      none      NaN       NaN
2  Computer-4               XP                    XP         7      NaN       NaN
3  Computer-Z                         Zebra      NaN       NaN    Zebra      Lion

Now you can use fillna() to combine the column data. This can also be achieved with combine_first()

new_df['OsName_x'].fillna(new_df['OsName_y'], inplace = True)
new_df['Upgrade_x'].fillna(new_df['Upgrade_y'], inplace = True)

The resultant dataframe now looks like this:

       CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0  Computer-8                7                     7         8      NaN       NaN
1  Computer-D               11                    11      none      NaN       NaN
2  Computer-4               XP                    XP         7      NaN       NaN
3  Computer-Z                         Zebra    Zebra      Lion    Zebra      Lion

You can now drop and rename columns as you have done in your existing code.

Why your code isn't producing the desired outcome is due to two reasons. While creating the 'tester' dataframe the specified suffixes are '' and '_y' instead of '_x' and '_y'. Then the subsequent code tries to rename columns with suffix '_x' (there no columns!), and drop the ones with suffix '_y'(the last 4 columns!). Before the renaming and dropping operation, the dataframe 'tester' looks like this:

       CompID WindowsOsVersion MacOsVersion OsName Upgrade WindowsOsVersion_y MacOsVersion_y OsName_y Upgrade_y
0  Computer-8                7                   7       8                  7                     NaN       NaN
1  Computer-D               11                  11    none                 11                     NaN       NaN
2  Computer-4               XP                  XP       7                 XP                     NaN       NaN
3  Computer-Z                         Zebra    NaN     NaN                             Zebra    Zebra      Lion