Update pandas cell in one Dataframe from looked up value in second Dataframe
I have a case where I need to update a cell in one Dataframe, 'Stock', which holds records of stock on-hand, looking up its value in a second Dataframe, 'Items', which is the table of all items. This is a simplified example of the Dataframes with relevant fields.
Stock
item_no qty
0 9H.111 101
1 9H.222 230
2 MODEL_B 136
3 9H.444 344
4 MODEL_E 505
5 9H.666 332
Items
item_no model_no
0 9H.111 MODEL_A
1 9H.222 MODEL_B
2 9H.333 MODEL_B
3 9H.444 MODEL_C
4 9H.555 MODEL_D
5 9H.666 MODEL_E
6 9H.777 MODEL_D
7 9H.888 MODEL_F
The challenge
I have previously done this in PostgreSQL but would like to see if I can do all the processing in Pandas (I plan to link to the PostgreSQL table of items). If we look at the Stock table the item_no
column should only have item numbers, see Items Dataframe (table), but sometimes the users put in the model_no
instead of the item number. So in the Stock
dataframe, row 2 incorrectly has the value MODEL_B
.
What's needed
What is needed to be done is to:
- get the value
MODEL_B
from theitem_no
column in the Stock dataframe - find that in the
model_no
column of the Items dataframe - then get the value from the
item_no
field of the Items dataframe - use that value to replace the (incorrect) model number value in the
item_no
column of the Stock dataframe
It gets a little more challenging... a model may have more than one part number:
1 9H.222 MODEL_B
2 9H.333 MODEL_B
In this case the 'highest' part number, in this case 9H.333
, needs to be used. In SQL I use the MAX() operator.
I would like to perform this using 'set' operations in pandas (not looping), similar to running a query in SQL. So this would mean (?) joining the two dataframes on the fields stock.item_no
<-> items.model_no
(?) - I'm not sure how to go about it hence the question marks.
Generate Dataframes
This code will generate the dataframes discussed above.
stock = pd.DataFrame({
'item_no': ['9H.111', '9H.222', 'MODEL_B', '9H.444', 'MODEL_E', '9H.666'],
'qty': [101, 230, 136, 344, 505, 332],
})
items = pd.DataFrame({
'item_no': ['9H.111', '9H.222', '9H.333', '9H.444', '9H.555', '9H.666', '9H.777', '9H.888'],
'model_no': ['MODEL_A', 'MODEL_B', 'MODEL_B', 'MODEL_C', 'MODEL_D', 'MODEL_E', 'MODEL_D', 'MODEL_F']
})
display(stock)
display(items)
CodePudding user response:
You can use:
# keep max item_no per model_no
# convert to mapping Series
mapper = (items.sort_values(by='item_no')
.drop_duplicates(subset='model_no', keep='last')
.set_index('model_no')['item_no']
)
# identify rows with model_no
m = stock['item_no'].isin(mapper.index)
# replace the values in place
stock.loc[m, 'item_no'] = stock.loc[m, 'item_no'].map(mapper)
updated stock
:
item_no qty
0 9H.111 101
1 9H.222 230
2 9H.333 136
3 9H.444 344
4 9H.666 505
5 9H.666 332