Pandas match empty string or spaces-CodePudding

In a large program, I found a bug that numbers were coming out as NaN. I'm having trouble understanding the cause.

In the context of this larger problem, the import data often is detected as object because there are may be blanks "" or spaces " ". I want to change "" or " " (any number of spaces) to np.nan. I want this to work whether the input is a series with one element or many, and if the element is numeric or string.

However, when I try the regex that seems obvious, it changes a valid number to NaN. This is inside the debugger:

ipdb> temp_v1
value    1250.00
Name: standard, dtype: object
ipdb> temp_v1.replace(r"\s*", np.nan, regex=True)
value   NaN
Name: standard, dtype: float64
ipdb> type(temp_v1)
<class 'pandas.core.series.Series'>
ipdb> temp_v1.dtype
dtype('O')
ipdb> temp_v1.str.replace(r"\s*", np.nan, regex=True)
*** TypeError: repl must be a string or callable
ipdb> temp_v1.astype(str).replace(r"\s*", np.nan, regex=True)
value   NaN
Name: standard, dtype: float64

This is a simple mre for experimentation:

import pandas as pd
import numpy as np

x = pd.Series(['1250.00'])

x.replace(r"\s*", np.nan, regex=True)

When I run that, the output is NaN. Usually...

If I fiddle around in the terminal and try this over and over, it works correctly sometimes, but fails usually. This one succeeds, but I have no idea why.

In [19]: x = pd.Series(['123.44'])

In [20]: x.replace(r"\s*", "", regex=True)
Out[20]: 
0    123.44
dtype: object

Do same a few lines later, fail:


In [29]: x = pd.Series(['123.44'])

In [30]: x.replace(r'\s*', np.nan, regex=True)
Out[30]: 
0   NaN
dtype: float64

CodePudding user response：

It appears that python is stumbling on the regex. It can successfully identify the regex for checking for spaces, but sees everything has "no space". If I replace the np.nan with the string "7" then I can see that the replace statement is replacing all of the empty space between characters (instead of just checking for an empty cell).

If I replace your regex with an "or" statement checking for spaces or a completely empty cell, then I get what you are wanting to see.

x = pd.Series(['1250.00','',' '])
x.replace(r'^\s*$| ', np.nan, regex=True)

CodePudding user response：

If you only want to handle blank and multi-space strings ("" or "\s ") you can use pd.to_numeric with errors="coerce"

import pandas as pd  # version 1.1.5

x = pd.Series(["125.00", "", " ", "123.44"])

pd.to_numeric(x, errors="coerce")

Out

0    125.00
1       NaN
2       NaN
3    123.44
dtype: float64

Will also work if other string-values are present, though you may wish to inspect these further:

x = pd.Series(["125.00", "", " ", "123.44", "test"])

pd.to_numeric(x, errors="coerce")

Out

0    125.00
1       NaN
2       NaN
3    123.44
4       NaN
dtype: float64