I need to add a new column to a dataframe based on an ID in the other one.
I created a small snippet of what I'm trying to do:
import pandas as pd
import numpy as np
a = pd.DataFrame([['ass-123-43', ['123', '456']],['ass-123-44', ['123', '457']]], columns=['customer_id', 'order_id'])
b = pd.DataFrame([['ass-123-43'], ['ass-123-44']], columns=['customer_id'])
dict_a = a.set_index('customer_id').order_id.to_dict()
b['order_id'] = np.nan
for customer_id, order_id in dict_a.items():
if customer_id in b.customer_id.values:
b.iloc[b.customer_id == customer_id, 1] = pd.Series([order_id])
print(b)
When I use the iloc method, the code works as expected:
customer_id order_id
0 ass-123-43 [123, 456]
1 ass-123-44 [123, 457]
But when I use loc method it doesn't work as expected:
import pandas as pd
import numpy as np
a = pd.DataFrame([['ass-123-43', ['123', '456']],['ass-123-44', ['123', '457']]], columns=['customer_id', 'order_id'])
b = pd.DataFrame([['ass-123-43'], ['ass-123-44']], columns=['customer_id'])
dict_a = a.set_index('customer_id').order_id.to_dict()
b['order_id'] = np.nan
for customer_id, order_id in dict_a.items():
if customer_id in b.customer_id.values:
b.loc[b.customer_id == customer_id, 'order_id'] = pd.Series([order_id])
print(b)
I got this result:
customer_id order_id
0 ass-123-43 [123, 456]
1 ass-123-44 NaN
Beyond loc using labels to assign the row and iloc a number, is there something else I missed?
CodePudding user response:
You can read more about the differences between .iloc
and .loc
here, but for your particular case, the reason you're getting NaN is because of what you're assigning. With .iloc
, it completely ignores the index of the value that you're assigning (which is pd.Series([order_id])
), so it works fine and doesn't produce NaN.
With .loc
, however, it does respect the index. In your example, pd.Series([order_id])
has an index of [0]
, as you can see:
>>> order_id = '123'
>>> pd.Series([order_id])
0 123
dtype: object
Now look at the index of the row where the NaN is occuring. It's 1
. But the index of the value you're trying to assign to it is 0
, as shown above. Mismatched indexes! What happens? The missing value - NaN.
If you want use .loc
instead of .iloc
, you can avoid this mismatched-index problem by converting the Series object to a numpy array (using .to_numpy()
) before assigning:
b.loc[b.customer_id == customer_id, 'order_id'] = pd.Series([order_id]).to_numpy()
That will work as expected.