I have a reference table that I imported into a dataframe(df2) from a .csv. It's 3 columns and around 400 rows. I have another dataframe (df) that has many columns and rows. I am looking to lookup a value from the reference table and add it to the appropriate column in df.
The data format for the reference table:
MANUF PRODTYPE PRODCODE
ALPHA 1 ALPHA1
ALPHA 2 ALPHA2
BETA 1 BETA1
BETA 2 BETA2
DELTA 1 DELTA1
DELTA 2 DELTA2
The dataframe (df) is set up like this:
MANUF PRODTYPE SERIALNO PRODCODE INVENTORY
ALPHA 1 00001 5
ALPHA 2 00001 3
BETA 1 00001 4
DELTA 1 00001 8
ALPHA 2 00002 3
BETA 1 00002 4
DELTA 2 00001 9
DELTA 2 00002 9
DELTA 1 00002 8
BETA 2 00001 12
ALPHA 2 00003 3
I am trying to populate PRODCODE in df with the appropriate value based on MANUF and PRODTYPE in the reference table.
I tried:
df3 = df.merge(df2, how='left')
and
df3 = df2.merge(df, how='left')
but both gave me either inaccurate or incomplete merges.
CodePudding user response:
I expect this to work in your case:
import sys
import pandas as pd
from io import StringIO
from datetime import datetime
data1 = StringIO("""MANUF;PRODTYPE;PRODCODE
ALPHA;1;ACME1
ALPHA;2;ACME2
BETA;1;BETA1
BETA;2;BETA2
DELTA;1;DELTA1
DELTA;2;DELTA2
""")
df1 = pd.read_csv(data1, sep=";")
print(df1)
data2 = StringIO("""MANUF;PRODTYPE;SERIALNO;PRODCODE;INVENTORY
ALPHA;1;00001;5
ALPHA;2;00001;3
BETA;1;00001;4
DELTA;1;00001;8
ALPHA;2;00002;3
BETA;1;00002;4
DELTA;2;00001;9
DELTA;2;00002;9
DELTA;1;00002;8
BETA;2;00001;12
ALPHA;2;00003;3
""")
df2 = pd.read_csv(data2, sep=";")
print(df2)
df3 = df2.merge(df1, on=['MANUF', 'PRODTYPE'], how='left')
print(df3)
Result:
MANUF PRODTYPE SERIALNO PRODCODE_x INVENTORY PRODCODE_y
0 ALPHA 1 1 5 NaN ACME1
1 ALPHA 2 1 3 NaN ACME2
2 BETA 1 1 4 NaN BETA1
3 DELTA 1 1 8 NaN DELTA1
4 ALPHA 2 2 3 NaN ACME2
5 BETA 1 2 4 NaN BETA1
6 DELTA 2 1 9 NaN DELTA2
7 DELTA 2 2 9 NaN DELTA2
8 DELTA 1 2 8 NaN DELTA1
9 BETA 2 1 12 NaN BETA2
10 ALPHA 2 3 3 NaN ACME2
CodePudding user response:
Another way without merge
would be this:
df2 = df2.set_index(['MANUF', 'PRODTYPE'])
output = df2.combine_first(df1.set_index(['MANUF', 'PRODTYPE'])).reset_index()
print(output)
MANUF PRODTYPE INVENTORY PRODCODE SERIALNO
0 ALPHA 1 5 ALPHA1 1
1 ALPHA 2 3 ALPHA2 1
2 ALPHA 2 3 ALPHA2 2
3 ALPHA 2 3 ALPHA2 3
4 BETA 1 4 BETA1 1
5 BETA 1 4 BETA1 2
6 BETA 2 12 BETA2 1
7 DELTA 1 8 DELTA1 1
8 DELTA 1 8 DELTA1 2
9 DELTA 2 9 DELTA2 1
10 DELTA 2 9 DELTA2 2
Used Input:
df1 = pd.DataFrame({'MANUF': {0: 'ALPHA',
1: 'ALPHA',
2: 'BETA',
3: 'BETA',
4: 'DELTA',
5: 'DELTA'},
'PRODTYPE': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2},
'PRODCODE': {0: 'ALPHA1',
1: 'ALPHA2',
2: 'BETA1',
3: 'BETA2',
4: 'DELTA1',
5: 'DELTA2'}})
df2 = pd.DataFrame({'MANUF': {0: 'ALPHA',
1: 'ALPHA',
2: 'BETA',
3: 'DELTA',
4: 'ALPHA',
5: 'BETA',
6: 'DELTA',
7: 'DELTA',
8: 'DELTA',
9: 'BETA',
10: 'ALPHA'},
'PRODTYPE': {0: 1,
1: 2,
2: 1,
3: 1,
4: 2,
5: 1,
6: 2,
7: 2,
8: 1,
9: 2,
10: 2},
'SERIALNO': {0: 1,
1: 1,
2: 1,
3: 1,
4: 2,
5: 2,
6: 1,
7: 2,
8: 2,
9: 1,
10: 3},
'INVENTORY': {0: 5,
1: 3,
2: 4,
3: 8,
4: 3,
5: 4,
6: 9,
7: 9,
8: 8,
9: 12,
10: 3}})