Here is a simple demonstration code. I expect dtypes as{'a':str, 'b':str, 'c':int, 'd':float}
, especially, str b
is honored as the input.
import io
import pandas as pd
s = """a,b,c,d
a1,,1,1.2
a2,00,,3.3
a3,01,3,2.3
a4,10,4,1.3"""
df = pd.read_csv(io.StringIO(s))
df
a b c d
0 a1 NaN 1.0 1.2
1 a2 0.0 NaN 3.3
2 a3 1.0 3.0 2.3
3 a4 10.0 4.0 1.3
Not what I expect (column b
is wrong).
df = pd.read_csv(io.StringIO(s), dtype=str, na_filter=False)
df
a b c d
0 a1 1 1.2
1 a2 00 3.3
2 a3 01 3 2.3
3 a4 10 4 1.3
Looks correct, but failed when converting column c
to int or float.
df['c'] = df['c'].astype(int)
---------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_3903327/3842575137.py in <module>
1 df = pd.read_csv(io.StringIO(s), dtype=str, na_filter=False)
----> 2 df['c'] = df['c'].astype('float')
3 #df['d'] = df['d'].astype(float)
4 df
...
ValueError: invalid literal for int() with base 10: ''
The following code gives the same error
df = pd.read_csv(io.StringIO(s),
dtype={'a':str, 'b':str, 'c':int, 'd':float},
na_filter=False)
df
CodePudding user response:
The error is because regular int
data type is not nullable and your data includes the null value.
To work with the nullable value, Pandas has an extension types. Int8, Int16, Int32, Int64 are the extension of integer type, called Nullable integer data type. The null value will be replaced with null-like value (pd.NA).
In order to use these extension types for read_csv
, you can pass with dtype
. Pick any of "IntXX" variation that is appropriate for your data.
df = pd.read_csv(io.StringIO(s), header=0, dtype={'a': str, 'b': str, 'c': 'Int32', 'd': float})
You can later convert to regular int
using astype
.
df['c'] = df.c.fillna(0).astype(int)
For more operations with the nullable integer type, please refer to the link above.