When reading a csv file with pandas.read_csv
, it seems that when using the keyword converters
some other preprocessing arguments (tested with decimal
and na_values
) seem to have no effect. Example:
import pandas as pd
from io import StringIO
txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 n/a 7"
buffer = StringIO(txt)
converters = {i: lambda x: x for i in range(3)} # dummy converter
df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters)
print(df1)
print(df2)
A B C
0 12.5 4.0 NaN
1 3.1 7.0 5.6
2 8.0 NaN 7.0
A B C
0 12,5 4 xxx
1 3,1 7 5,6
2 8 n/a 7
df1
is imported correctly, while df2
holds unconverted strings (dtype: object). Obviously the arguments decimal
and na_values
are ignored as soon as a converter is defined explicitly for the column. Even the integrated default NA-conversion of the string "n/a"
fails (keep_default_na=True
)
Question: Is there a way to use converters
and other preprocessing arguments together? Or do I have to extend my application specific converters with additional decimal sign and NA-value converters manually?
CodePudding user response:
No it's not possible. Even if the documentation is not clear about sep
, decimal
and thousands
parameters except for dtype
:
If converters are specified, they will be applied INSTEAD of dtype conversion.
Consider this parameters as only used by the default converter of read_csv
.
A possible solution is to let read_csv
parse your file then use assign
to modify the value of each column.
For instance:
df = (pd.read_csv('input.csv', sep=';', decimal=',', thousands='_')
.assign(col1=lambda x: converters(x['col1']),
col2=lambda x: converters(x['col2']))
CodePudding user response:
I didn't look into detail why it is so, but it looks like this depends on engine used. You can specify engine='python'
in order for it to work
import pandas as pd
from io import StringIO
txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 4 7"
buffer = StringIO(txt)
converters = {i: lambda x: x for i in range(3)} # dummy converter
df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters, engine='python')
print(df1)
print(df2)
output:
A B C
0 12.5 4 NaN
1 3.1 7 5.6
2 8.0 4 7.0
A B C
0 12.5 4 NaN
1 3.1 7 5.6
2 8 4 7