I'm still getting to grips with working in pandas.
What I'd like to do is convert a column's type (from string to integer). The column is encoded as string data but includes mostly integer values. I'd like the whole column to be of type integer. On the few occasions where conversion is not possible, I'd like it to just be NA / nan.
I'm migrating from R, where this behaviour is somewhat expected:
df <- data.frame(
"id" = c(1,2,3),
"age" = c("12", "not_an_age", "34 and a half")
)
converted_df <- dplyr::mutate(df, age = as.numeric(age))
converted_df
### output
# id age
# 1 12
# 2 NA
# 3 NA
In Python
df = pd.DataFrame({'id':[1,2,3], 'age':['12', 'not_an_age', '34 and a half']})
# not run
# as type only allows errors to be raised or ignored not coerced
df['age'].astype('int')
How can I create the result I expect from R, in pandas? It feels like there is a function/argument to a function I'm forgetting about.
Thanks
CodePudding user response:
To deal with mixed integer and NaN use a IntXXDType
:
>>> pd.to_numeric(df.age, errors='coerce').astype(pd.Int16Dtype())
0 12
1 <NA>
2 <NA>
Name: age, dtype: Int16
If you use int
, it will raise an exception:
>>> pd.to_numeric(df.age, errors='coerce').astype(int)
...
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer