Home > Software design >  pandas: convert column with multiple datatypes to int, ignore errors
pandas: convert column with multiple datatypes to int, ignore errors

Time:08-10

I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example

df = pd.DataFrame({'code': ['1170E1', '1.17E 04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors  = 'ignore')

The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:

int(1170E1), int(1.17E 04), int(11700.0)
> (11700, 11700, 11700)

Any help in solving this would be appreciated. What i'm expecting the output to look like is:

0    '11700'
1    '11700'
2    '11700
3    '24477G'
4    '124601'
5    '247602'

CodePudding user response:

You may check with pd.to_numeric

df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]: 
0     11700.0
1     11700.0
2     11700.0
3      24477G
4    124601.0
5    247602.0
Name: code, dtype: object

Update

df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]: 
     code
0   11700
1   11700
2   11700
3  24477G
4  124601
5  247602

CodePudding user response:

BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.

def convert(x):
    try:
        return str(int(float(x)))
    except ValueError:
        return x

df = pd.DataFrame({'code': ['1170E1', '1.17E 04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)

outputs

0     11700
1     11700
2     11700
3    24477G
4    124601
5    247602

where each element is a string.

I will be the first to say, I'm not proud of that triple cast.

  • Related