I'm using .loc
to filter my DataFrame and set values from another column. Here's a short example, first setting up a simple DataFrame:
df = pd.DataFrame({
'name': ['Steve', 'Julie', 'Dan', 'Mary'],
'nickname': ['Steve-o', 'Jules', '', '']
})
# name nickname
# 0 Steve Steve-o
# 1 Julie Jules
# 2 Dan
# 3 Mary
Next I replace the missing values using .loc
like this:
df.loc[df['nickname'] == '', 'nickname'] = df['name']
# name nickname
# 0 Steve Steve-o
# 1 Julie Jules
# 2 Dan Dan
# 3 Mary Mary
Why does this even work? I'm assigning a series of a different length, how does it know to only match the items with the same index number? Why doesn't it throw an error about the series being different lengths? This just seems very magical to me and it makes me nervous.
After lots of searching, I can't find any examples of this approach in the official documentation. I have seen a few rare examples of this out in the wild, but with no explanation or reassurance that it's a correct approach.
CodePudding user response:
It's because Pandas sets the values by the corresponding index. The index of the name
is 0
1
2
3
, but the located rows (loc
) indexes are 2
and 3
, so it would reindex the original name
column to only the 2
and 3
index.
Here is a proof:
If we were to assign to a completely different index:
>>> df['name'].set_axis(['a', 'b', 'c', 'd'])
a Steve
b Julie
c Dan
d Mary
Name: name, dtype: object
>>>
It wouldn't work, example:
>>> df.loc[df['nickname'] == '', 'nickname'] = df['name'].set_axis(['a', 'b', 'c', 'd'])
>>> df
name nickname
0 Steve Steve-o
1 Julie Jules
2 Dan NaN
3 Mary NaN
>>>
There is the proof! loc
assignment is based on row indexes, in this case None of the indexes matched, so it gave missing values (NaN
).
If you use a list or array which doesn't have indexes, this wouldn't work.
This is the "magic" of Pandas.
Do not think that indexes are nothing, Pandas assigns everything based on the indexes.
Performance-wise I prefer to use:
df.loc[df['nickname'] == '', 'nickname'] = df.loc[df['nickname'] == '', 'name']
To make Pandas not need to do the extra work of matching the indexes.
For some cases with big DataFrame
s:
df.loc[df['nickname'] == '', 'nickname'] = df['name']
Would give an MemoryError
, but:
df.loc[df['nickname'] == '', 'nickname'] = df.loc[df['nickname'] == '', 'name']
Wouldn't, because the number of rows of the assignment are less.