Home > Mobile >  Is this a safe way to set values in a DataFrame? Why does this work?
Is this a safe way to set values in a DataFrame? Why does this work?

Time:09-26

I'm using .loc to filter my DataFrame and set values from another column. Here's a short example, first setting up a simple DataFrame:

df = pd.DataFrame({
    'name': ['Steve', 'Julie', 'Dan', 'Mary'],
    'nickname': ['Steve-o', 'Jules', '', '']
})

#     name  nickname
# 0  Steve   Steve-o
# 1  Julie     Jules
# 2    Dan         
# 3   Mary  

Next I replace the missing values using .loc like this:

df.loc[df['nickname'] == '', 'nickname'] = df['name']

#     name  nickname
# 0  Steve   Steve-o
# 1  Julie     Jules
# 2    Dan       Dan
# 3   Mary      Mary

Why does this even work? I'm assigning a series of a different length, how does it know to only match the items with the same index number? Why doesn't it throw an error about the series being different lengths? This just seems very magical to me and it makes me nervous.

After lots of searching, I can't find any examples of this approach in the official documentation. I have seen a few rare examples of this out in the wild, but with no explanation or reassurance that it's a correct approach.

CodePudding user response:

It's because Pandas sets the values by the corresponding index. The index of the name is 0 1 2 3, but the located rows (loc) indexes are 2 and 3, so it would reindex the original name column to only the 2 and 3 index.

Here is a proof:

If we were to assign to a completely different index:

>>> df['name'].set_axis(['a', 'b', 'c', 'd'])
a    Steve
b    Julie
c      Dan
d     Mary
Name: name, dtype: object
>>> 

It wouldn't work, example:

>>> df.loc[df['nickname'] == '', 'nickname'] = df['name'].set_axis(['a', 'b', 'c', 'd'])
>>> df
    name nickname
0  Steve  Steve-o
1  Julie    Jules
2    Dan      NaN
3   Mary      NaN
>>> 

There is the proof! loc assignment is based on row indexes, in this case None of the indexes matched, so it gave missing values (NaN).

If you use a list or array which doesn't have indexes, this wouldn't work.

This is the "magic" of Pandas.

Do not think that indexes are nothing, Pandas assigns everything based on the indexes.


Performance-wise I prefer to use:

df.loc[df['nickname'] == '', 'nickname'] = df.loc[df['nickname'] == '', 'name']

To make Pandas not need to do the extra work of matching the indexes.

For some cases with big DataFrames:

df.loc[df['nickname'] == '', 'nickname'] = df['name']

Would give an MemoryError, but:

df.loc[df['nickname'] == '', 'nickname'] = df.loc[df['nickname'] == '', 'name']

Wouldn't, because the number of rows of the assignment are less.

  • Related