Home > database >  pandas.read_csv: Using converters AND other preprocessors together
pandas.read_csv: Using converters AND other preprocessors together

Time:06-09

When reading a csv file with pandas.read_csv, it seems that when using the keyword converters some other preprocessing arguments (tested with decimal and na_values) seem to have no effect. Example:

import pandas as pd
from io import StringIO

txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 n/a 7"
buffer = StringIO(txt)

converters = {i: lambda x: x for i in range(3)}  # dummy converter

df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters)

print(df1)
print(df2)


      A    B    C
0  12.5  4.0  NaN
1   3.1  7.0  5.6
2   8.0  NaN  7.0
      A    B    C
0  12,5    4  xxx
1   3,1    7  5,6
2     8  n/a    7

df1 is imported correctly, while df2 holds unconverted strings (dtype: object). Obviously the arguments decimal and na_values are ignored as soon as a converter is defined explicitly for the column. Even the integrated default NA-conversion of the string "n/a" fails (keep_default_na=True)

Question: Is there a way to use converters and other preprocessing arguments together? Or do I have to extend my application specific converters with additional decimal sign and NA-value converters manually?

CodePudding user response:

No it's not possible. Even if the documentation is not clear about sep, decimal and thousands parameters except for dtype:

If converters are specified, they will be applied INSTEAD of dtype conversion.

Consider this parameters as only used by the default converter of read_csv.

A possible solution is to let read_csv parse your file then use assign to modify the value of each column.

For instance:

df = (pd.read_csv('input.csv', sep=';', decimal=',', thousands='_')
        .assign(col1=lambda x: converters(x['col1']),
                col2=lambda x: converters(x['col2']))

CodePudding user response:

I didn't look into detail why it is so, but it looks like this depends on engine used. You can specify engine='python' in order for it to work

import pandas as pd
from io import StringIO

txt = "A B C\n12,5 4 xxx\n3,1 7 5,6\n8 4 7"
buffer = StringIO(txt)

converters = {i: lambda x: x for i in range(3)}  # dummy converter

df1 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"])
buffer.seek(0)
df2 = pd.read_csv(buffer, sep=" ", decimal=",", na_values=["xxx"], converters=converters, engine='python')

print(df1)
print(df2)

output:

      A  B    C
0  12.5  4  NaN
1   3.1  7  5.6
2   8.0  4  7.0
      A  B    C
0  12.5  4  NaN
1   3.1  7  5.6
2     8  4    7
  • Related