I have a pandas series where data is of type string
.
import pandas as pd
import numpy as np
df['count']
0 4
1 nan
2 nan
3 1
4 nan
5 227.0
...
635 nan
636 nan
637 1
638 8
639 None
I'd like to replace nan
with np.nan
and set the data type of numeric values to int
I tried this, but it didn't work. Values are still strings.
df['count'] = df['count'].apply(lambda x: np.where(x.isdigit(), x, np.nan))
CodePudding user response:
What I would do is:
df['count'] = pd.to_numeric(df['count'], errors='coerce')
After that, your column will be np.float64
and anything that could not be converted to a float will be np.nan
.
A common way to convert such a column to int
is to chose a value to replace 'nan'. That is application-dependent, of course, but since your column name is 'count', a value of -1 could perhaps be adequate.
Alternatively, you can use pandas' nullable integer.
Example
df = pd.DataFrame('4 nan nan 1 nan 227.0 1 8 None'.split(), columns=['count'])
>>> df
count
0 4
1 nan
2 nan
3 1
4 nan
5 227.0
6 1
7 8
8 None
Method 1: convert to numeric, then to int with -1 to indicate "bad value":
newdf = df.assign(
count=pd.to_numeric(df['count'], errors='coerce')
.fillna(-1)
.astype(int)
)
>>> newdf
count
0 4
1 -1
2 -1
3 1
4 -1
5 227
6 1
7 8
8 -1
Method 2: convert to 'Int64' (nullable integer):
newdf = df.assign(
count=pd.to_numeric(df['count'], errors='coerce')
.astype('Int64')
)
>>> newdf
count
0 4
1 <NA>
2 <NA>
3 1
4 <NA>
5 227
6 1
7 8
8 <NA>
CodePudding user response:
You can run a list comprehension for this:
df['count'] = [int(val) if val != "nan" else np.nan for val in df['count']]
If the casting doesn't work (i.e. you have other numerics that are not int
), you can try casting as a float as a catch all and then convert to an int: int(float(val))
df['count'] = [int(float(val)) if val != "nan" else np.nan for val in df['count']]
Alternatively, if you wish to be more pandthonic, you can try to use one of these approaches (specifically, to_numeric()
).