In a large program, I found a bug that numbers were coming out as NaN
. I'm having trouble understanding the cause.
In the context of this larger problem, the import data often is detected as object because there are may be blanks "" or spaces " ". I want to change "" or " " (any number of spaces) to np.nan
. I want this to work whether the input is a series with one element or many, and if the element is numeric or string.
However, when I try the regex that seems obvious, it changes a valid number to NaN. This is inside the debugger:
ipdb> temp_v1
value 1250.00
Name: standard, dtype: object
ipdb> temp_v1.replace(r"\s*", np.nan, regex=True)
value NaN
Name: standard, dtype: float64
ipdb> type(temp_v1)
<class 'pandas.core.series.Series'>
ipdb> temp_v1.dtype
dtype('O')
ipdb> temp_v1.str.replace(r"\s*", np.nan, regex=True)
*** TypeError: repl must be a string or callable
ipdb> temp_v1.astype(str).replace(r"\s*", np.nan, regex=True)
value NaN
Name: standard, dtype: float64
This is a simple mre for experimentation:
import pandas as pd
import numpy as np
x = pd.Series(['1250.00'])
x.replace(r"\s*", np.nan, regex=True)
When I run that, the output is NaN. Usually...
If I fiddle around in the terminal and try this over and over, it works correctly sometimes, but fails usually. This one succeeds, but I have no idea why.
In [19]: x = pd.Series(['123.44'])
In [20]: x.replace(r"\s*", "", regex=True)
Out[20]:
0 123.44
dtype: object
Do same a few lines later, fail:
In [29]: x = pd.Series(['123.44'])
In [30]: x.replace(r'\s*', np.nan, regex=True)
Out[30]:
0 NaN
dtype: float64
CodePudding user response:
It appears that python is stumbling on the regex. It can successfully identify the regex for checking for spaces, but sees everything has "no space". If I replace the np.nan with the string "7" then I can see that the replace statement is replacing all of the empty space between characters (instead of just checking for an empty cell).
If I replace your regex with an "or" statement checking for spaces or a completely empty cell, then I get what you are wanting to see.
x = pd.Series(['1250.00','',' '])
x.replace(r'^\s*$| ', np.nan, regex=True)
CodePudding user response:
If you only want to handle blank and multi-space strings (""
or "\s "
) you can use pd.to_numeric
with errors="coerce"
import pandas as pd # version 1.1.5
x = pd.Series(["125.00", "", " ", "123.44"])
pd.to_numeric(x, errors="coerce")
Out
0 125.00
1 NaN
2 NaN
3 123.44
dtype: float64
Will also work if other string-values are present, though you may wish to inspect these further:
x = pd.Series(["125.00", "", " ", "123.44", "test"])
pd.to_numeric(x, errors="coerce")
Out
0 125.00
1 NaN
2 NaN
3 123.44
4 NaN
dtype: float64