Home > Software engineering >  Pandas series replace strings with numpy nan
Pandas series replace strings with numpy nan

Time:04-04

I have a pandas series where data is of type string.

import pandas as pd
import numpy as np

df['count']

0         4
1       nan
2       nan
3         1
4       nan
5       227.0
        ... 
635     nan
636     nan
637       1
638       8
639    None

I'd like to replace nan with np.nan and set the data type of numeric values to int

I tried this, but it didn't work. Values are still strings.

df['count'] = df['count'].apply(lambda x: np.where(x.isdigit(), x, np.nan))

CodePudding user response:

What I would do is:

df['count'] = pd.to_numeric(df['count'], errors='coerce')

After that, your column will be np.float64 and anything that could not be converted to a float will be np.nan.

A common way to convert such a column to int is to chose a value to replace 'nan'. That is application-dependent, of course, but since your column name is 'count', a value of -1 could perhaps be adequate.

Alternatively, you can use pandas' nullable integer.

Example

df = pd.DataFrame('4 nan nan 1 nan 227.0 1 8 None'.split(), columns=['count'])

>>> df
   count
0      4
1    nan
2    nan
3      1
4    nan
5  227.0
6      1
7      8
8   None

Method 1: convert to numeric, then to int with -1 to indicate "bad value":

newdf = df.assign(
    count=pd.to_numeric(df['count'], errors='coerce')
    .fillna(-1)
    .astype(int)
)

>>> newdf
   count
0      4
1     -1
2     -1
3      1
4     -1
5    227
6      1
7      8
8     -1

Method 2: convert to 'Int64' (nullable integer):

newdf = df.assign(
    count=pd.to_numeric(df['count'], errors='coerce')
    .astype('Int64')
)

>>> newdf
   count
0      4
1   <NA>
2   <NA>
3      1
4   <NA>
5    227
6      1
7      8
8   <NA>

CodePudding user response:

You can run a list comprehension for this:

df['count'] = [int(val) if val != "nan" else np.nan for val in df['count']]

If the casting doesn't work (i.e. you have other numerics that are not int), you can try casting as a float as a catch all and then convert to an int: int(float(val))

df['count'] = [int(float(val)) if val != "nan" else np.nan for val in df['count']]

Alternatively, if you wish to be more pandthonic, you can try to use one of these approaches (specifically, to_numeric()).

  • Related