I have two tables, PO data and commodity code data. Some genius decided that some material group codes should be the same as they are differentiated at a lower level by GL accounts. Because of that, I can't merge on material groups, as I'll get duplicate rows.
Assume the following:
import pandas as pd
d1 = {'PO':[123456,654321,971358], 'matgrp': ["1001",'803A',"803B"]}
d2 = {'matgrp':["1001", "1001", "803A", "803B"], 'commodity':['foo - 10001', 'bar - 10002', 'spam - 100003','eggs - 10003']}
pos = pd.DataFrame(data=d1)
mat_grp = pd.DataFrame(data=d2)
merged = pd.merge(pos, mat_grp, how='left', on='matgrp')
merged.head()
PO matgrp commodity
0 123456 1001 foo - 10001
1 123456 1001 bar - 10002
2 654321 803A spam - 100003
3 971358 803B eggs - 10003
As you can see, PO 123456 shows up twice, as there are multiple rows for material 1001 in the material groups table.
The desired behavior is that merge only merges once, finds the first entry for the material group, adds it, and nothing else, like how vlookup works. The long commodity code might be incorrect in some cases (always showing the first one), that's an acceptable inaccuracy.
ps.: while suggestions are welcome how to tackle this problem outside of the scope of this question (like merging on GL accounts, which is not feasible for other reasons) assume the following: The available data is a PO list from SAP ME81N and an Excel file with the list of material groups/commodity codes.
CodePudding user response:
pandas' merge
behaves (mostly) like a SQL merge and will provide all combinations of matching keys. If you only want the first item, simply remove it from the data you feed to merge.
Use drop_duplicates
on mat_grp
:
merged = pd.merge(pos, mat_grp.drop_duplicates('matgrp'), how='left', on='matgrp')
output:
PO matgrp commodity
0 123456 1001 foo - 10001
1 654321 803A spam - 100003
2 971358 803B eggs - 10003