Home > Mobile >  Merging two different dataframes
Merging two different dataframes

Time:04-07

I'm having an issue merging two different data frames based on column names.

Code:

import os, json, xlsxwriter
import pandas as pd

left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)

new_df = pd.merge(left, right,  how='inner', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right,  how='inner', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID")
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))

Current result:

left df:

CompId WindowsOsVersion MacOsVersion
Computer-8 7
Computer-D 11
Computer-4 XP
Computer-Z Zebra

right df:

OsName Upgrade Cost
XP 7 £5
7 8 £10
11 none £0
Zebra Lion £10

the outcome i want:

CompId WindowsOsVersion MacOsVersion OsName Upgrade Cost
Computer-8 7 7 8 £10
Computer-D 11 11 none £0
Computer-4 XP XP 7 £5
Computer-Z Zebra Zebra Lion £10

Any help would be appreciated

Updated code:

import os, json, xlsxwriter
import pandas as pd

left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)

new_df = pd.merge(left, right,  how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right,  how='left', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID", how='outer',suffixes=('', '_y'))
for col in tester:
        if col.endswith('_x'):
            tester.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in tester if col.endswith('_y')]
            tester.drop(to_drop,axis=1,inplace=True)
        else:
            pass
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))

Current table:

CompId WindowsOsVersion MacOsVersion OsName Upgrade
Computer-8 7 7 8
Computer-D 11 11 none
Computer-4 XP XP 7
Computer-Z Zebra NaN NaN

I'm not sure why the last column is not displaying the correct information?

Should be Zebra, Lion

CodePudding user response:

In a simplistic way, you can do the following:

First create the merged dataframes.

new_df = pd.merge(left, right,  how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df = pd.merge(new_df, right,  how='left', left_on=['MacOsVersion'], right_on = ['OsName'])

The dataframe at this point looks like this:

       CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0  Computer-8                7                     7         8      NaN       NaN
1  Computer-D               11                    11      none      NaN       NaN
2  Computer-4               XP                    XP         7      NaN       NaN
3  Computer-Z                         Zebra      NaN       NaN    Zebra      Lion

Now you can use fillna() to combine the column data. This can also be achieved with combine_first()

new_df['OsName_x'].fillna(new_df['OsName_y'], inplace = True)
new_df['Upgrade_x'].fillna(new_df['Upgrade_y'], inplace = True)

The resultant dataframe now looks like this:

       CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0  Computer-8                7                     7         8      NaN       NaN
1  Computer-D               11                    11      none      NaN       NaN
2  Computer-4               XP                    XP         7      NaN       NaN
3  Computer-Z                         Zebra    Zebra      Lion    Zebra      Lion

You can now drop and rename columns as you have done in your existing code.

Why your code isn't producing the desired outcome is due to two reasons. While creating the 'tester' dataframe the specified suffixes are '' and '_y' instead of '_x' and '_y'. Then the subsequent code tries to rename columns with suffix '_x' (there no columns!), and drop the ones with suffix '_y'(the last 4 columns!). Before the renaming and dropping operation, the dataframe 'tester' looks like this:

       CompID WindowsOsVersion MacOsVersion OsName Upgrade WindowsOsVersion_y MacOsVersion_y OsName_y Upgrade_y
0  Computer-8                7                   7       8                  7                     NaN       NaN
1  Computer-D               11                  11    none                 11                     NaN       NaN
2  Computer-4               XP                  XP       7                 XP                     NaN       NaN
3  Computer-Z                         Zebra    NaN     NaN                             Zebra    Zebra      Lion
  • Related