Home > Back-end >  Why cant Pandas replace nan with an array of 0s using masks/replace?
Why cant Pandas replace nan with an array of 0s using masks/replace?

Time:05-17

I have a series like this

s = pd.Series([[1,2,3],[1,2,3],np.nan,[1,2,3],[1,2,3],np.nan])

and I simply want the NaN to be replaced by [0,0,0].

I have tried

s.fillna([0,0,0]) # TypeError: "value" parameter must be a scalar or dict, but you passed a "list"

s[s.isna()] = [[0,0,0],[0,0,0]] # just replaces the NaN with a single "0". WHY?!

s.fillna("NAN").replace({"NAN":[0,0,0]}) # ValueError: NumPy boolean array indexing assignment cannot 
                                          #assign 3 input values to the 2 output values where the mask is true


s.fillna("NAN").replace({"NAN":[[0,0,0],[0,0,0]]}) # TypeError: NumPy boolean array indexing assignment
                                                   # requires a 0 or 1-dimensional input, input has 2 dimensions

I really can't understand, why the two first approaches won't work (maybe I get the first, but the second I cant wrap my head around).

Thanks to this SO-question and answer, we can do it by

is_na = s.isna()
s.loc[is_na] = s.loc[is_na].apply(lambda x: [0,0,0])

but since apply often is rather slow I cannot understand, why we can't use replace or the slicing as above

CodePudding user response:

Pandas working with list with pain, here is hacky solution:

s = s.fillna(pd.Series([[0,0,0]] * len(s), index=s.index))
print (s)
0    [1, 2, 3]
1    [1, 2, 3]
2    [0, 0, 0]
3    [1, 2, 3]
4    [1, 2, 3]
5    [0, 0, 0]
dtype: object

CodePudding user response:

Series.reindex

s.dropna().reindex(s.index, fill_value=[0, 0, 0])

0    [1, 2, 3]
1    [1, 2, 3]
2    [0, 0, 0]
3    [1, 2, 3]
4    [1, 2, 3]
5    [0, 0, 0]
dtype: object

CodePudding user response:

The documentation indicates that this value cannot be a list.

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

This is probably a limitation of the current implementation, and short of patching the source code you must resort to workarounds (as provided below).


However, if you are not planning to work with jagged arrays, what you really want to do is probably replace pd.Series() with pd.DataFrame(), e.g.:

import numpy as np
import pandas as pd


s = pd.DataFrame(
        [[1, 2, 3],
         [1, 2, 3],
         [np.nan],
         [1, 2, 3],
         [1, 2, 3],
         [np.nan]],
        dtype=pd.Int64Dtype())  # to mix integers with NaNs


s.fillna(0)
#    0  1  2
# 0  1  2  3
# 1  1  2  3
# 2  0  0  0
# 3  1  2  3
# 4  1  2  3
# 5  0  0  0

If you do need to use jagged array, you could use any of the proposed workaround from other answers, or you could make one of your attempt work, e.g.:

ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([[0, 0, 0]] * nn).to_numpy()
# 0    [1, 2, 3]
# 1    [1, 2, 3]
# 2    [0, 0, 0]
# 3    [1, 2, 3]
# 4    [1, 2, 3]
# 5    [0, 0, 0]
# dtype: object

which basically uses NumPy masking to fill in the Series. The trick is to generate a compatible object for the assignment that works at the NumPy level.

If there are too many NaNs in the input, it is probably more efficient / faster to work in a similar way but with s.notna() instead, e.g.:

import pandas as pd


result = pd.Series([[0, 0, 0]] * len(s))
result[s.notna()] = s[s.notna()]

Let's try to do some benchmarking, where:

  • replace_nan_isna() is from above
import pandas as pd


def replace_nan_isna(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    ii = s.isna()
    nn = ii.sum()
    s[ii] = pd.Series([value] * nn).to_numpy()
    return s
  • replace_nan_notna() is also from above
import pandas as pd


def replace_nan_notna(s, value, inplace=False):
    if inplace:
        raise ValueError("In-place not supported!")
    result = pd.Series([value] * len(s))
    result[s.notna()] = s[s.notna()]
    return result
def replace_nan_reindex(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    s.dropna().reindex(s.index, fill_value=value)
    return s
import pandas as pd


def replace_nan_fillna(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    s.fillna(pd.Series([value] * len(s), index=s.index))
    return s

with the following code:

import numpy as np
import pandas as pd


def gen_data(n=5, k=2, p=0.7, obj=(1, 2, 3)):
    return pd.Series(([obj] * int(p * n)   [np.nan] * (n - int(p * n))) * k)


funcs = replace_nan_isna, replace_nan_notna, replace_nan_reindex, replace_nan_fillna

# : inspect results
s = gen_data(5, 1)
for func in funcs:
    print(f'{func.__name__:>20s}  {func(s, value)}')
print()

# : generate benchmarks
s = gen_data(100, 1000)
value = (0, 0, 0)
base = funcs[0](s, value)
for func in funcs:
    print(f'{func.__name__:>20s}  {(func(s, value) == base).all()!s:>5}', end='  ')
    %timeit func(s, value)
#     replace_nan_isna   True  100 loops, best of 5: 16.5 ms per loop
#    replace_nan_notna   True  10 loops, best of 5: 46.5 ms per loop
#  replace_nan_reindex   True  100 loops, best of 5: 9.74 ms per loop
#   replace_nan_fillna   True  10 loops, best of 5: 36.4 ms per loop

indicating that reindex() may be the fastest approach.

  • Related