I'm having an issue merging two different data frames based on column names.
Code:
import os, json, xlsxwriter
import pandas as pd
left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)
new_df = pd.merge(left, right, how='inner', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right, how='inner', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID")
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))
Current result:
left df:
CompId | WindowsOsVersion | MacOsVersion |
---|---|---|
Computer-8 | 7 | |
Computer-D | 11 | |
Computer-4 | XP | |
Computer-Z | Zebra |
right df:
OsName | Upgrade | Cost |
---|---|---|
XP | 7 | £5 |
7 | 8 | £10 |
11 | none | £0 |
Zebra | Lion | £10 |
the outcome i want:
CompId | WindowsOsVersion | MacOsVersion | OsName | Upgrade | Cost |
---|---|---|---|---|---|
Computer-8 | 7 | 7 | 8 | £10 | |
Computer-D | 11 | 11 | none | £0 | |
Computer-4 | XP | XP | 7 | £5 | |
Computer-Z | Zebra | Zebra | Lion | £10 |
Any help would be appreciated
Updated code:
import os, json, xlsxwriter
import pandas as pd
left = pd.DataFrame({'CompID': ['Computer-8', 'Computer-D', 'Computer-4', 'Computer-Z'], 'WindowsOsVersion': ['7', '11', 'XP', ''],'MacOsVersion': ['', '', '', 'Zebra']})
print ("left df:")
print (left)
right = pd.DataFrame({'OsName': ['XP', '7', '11', 'Zebra'], 'Upgrade': ['7', '8', 'none', 'Lion']})
print ("right df:")
print (right)
new_df = pd.merge(left, right, how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df2 = pd.merge(left, right, how='left', left_on=['MacOsVersion'], right_on = ['OsName'])
print ("WindowsOsVersion df:")
print (new_df)
print ("MacOsVersion df:")
print (new_df2)
tester = pd.merge(new_df, new_df2, on="CompID", how='outer',suffixes=('', '_y'))
for col in tester:
if col.endswith('_x'):
tester.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
elif col.endswith('_y'):
to_drop = [col for col in tester if col.endswith('_y')]
tester.drop(to_drop,axis=1,inplace=True)
else:
pass
print ("Merge: ")
print (tester)
#print ("new df: ",left.merge(right, left_on=['WindowsOsVersion','MacOsVersion'], right_on='OsName'))
Current table:
CompId | WindowsOsVersion | MacOsVersion | OsName | Upgrade |
---|---|---|---|---|
Computer-8 | 7 | 7 | 8 | |
Computer-D | 11 | 11 | none | |
Computer-4 | XP | XP | 7 | |
Computer-Z | Zebra | NaN | NaN |
I'm not sure why the last column is not displaying the correct information?
Should be Zebra, Lion
CodePudding user response:
In a simplistic way, you can do the following:
First create the merged dataframes.
new_df = pd.merge(left, right, how='left', left_on=['WindowsOsVersion'], right_on = ['OsName'])
new_df = pd.merge(new_df, right, how='left', left_on=['MacOsVersion'], right_on = ['OsName'])
The dataframe at this point looks like this:
CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0 Computer-8 7 7 8 NaN NaN
1 Computer-D 11 11 none NaN NaN
2 Computer-4 XP XP 7 NaN NaN
3 Computer-Z Zebra NaN NaN Zebra Lion
Now you can use fillna() to combine the column data. This can also be achieved with combine_first()
new_df['OsName_x'].fillna(new_df['OsName_y'], inplace = True)
new_df['Upgrade_x'].fillna(new_df['Upgrade_y'], inplace = True)
The resultant dataframe now looks like this:
CompID WindowsOsVersion MacOsVersion OsName_x Upgrade_x OsName_y Upgrade_y
0 Computer-8 7 7 8 NaN NaN
1 Computer-D 11 11 none NaN NaN
2 Computer-4 XP XP 7 NaN NaN
3 Computer-Z Zebra Zebra Lion Zebra Lion
You can now drop and rename columns as you have done in your existing code.
Why your code isn't producing the desired outcome is due to two reasons. While creating the 'tester' dataframe the specified suffixes are '' and '_y' instead of '_x' and '_y'. Then the subsequent code tries to rename columns with suffix '_x' (there no columns!), and drop the ones with suffix '_y'(the last 4 columns!). Before the renaming and dropping operation, the dataframe 'tester' looks like this:
CompID WindowsOsVersion MacOsVersion OsName Upgrade WindowsOsVersion_y MacOsVersion_y OsName_y Upgrade_y
0 Computer-8 7 7 8 7 NaN NaN
1 Computer-D 11 11 none 11 NaN NaN
2 Computer-4 XP XP 7 XP NaN NaN
3 Computer-Z Zebra NaN NaN Zebra Zebra Lion