Pandas: Read csv where all values are quoted and comma is used as decimal separator-CodePudding

I am currently trying to import a csv file in pandas that has a less than perfect format, as all the values including numbers are quoted. The format looks like this:

Date;Type;Amount;Currency;Category;Person;Account;Counter Account;Group;Note
"19.02.17";"Expenses";"-36,37";"EUR";"Groceries";"";"Bank account";"";"";""

Now, I have tried importing this using the following command:

import pandas

dtypes = {"Type":"string", "Amount": "float"}
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], dtype = dtypes, quoting = 3)

So I have basically been trying to tell pandas that the decimal separator is comma, that the field delimiter is semicolon, and that the column "Amount" should be parsed as floats. However, trying to parse the file, I still get the error message:

ValueError: could not convert string to float: '689,15'"

I assume the combination of the quotes and the comma decimal separator somehow is too much for pandas, even though I think I have technically provided it with all the information it needs.

The file is an export from a third-party program, so unfortunately I have no influence on the format. Does anyone know how to get pandas to swallow this?

Bonus question: If I read this file without providing explicit data types, I don't get any columns of type "string" as I would have expected, but instead "object" is used. Why is that?

CodePudding user response：

Ok, I found a way that works using the converters argument to pass a lambda that does the conversion. I wasn't able to find a simpler solution so far.

toFloat = lambda x: float(x.replace(".", "").replace(",", "."))
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], converters = {"Amount": toFloat})

The lambda toFloat takes a string, removes all occurrences of the . character (used as digit grouping symbol), replaces all occurrences of , by . and then converts the resulting string to float. This lambda is then passed as a converter for the column "Amount" to the read_csv function. Probably now also the decimal argument is expendable.

I am going to leave this question open for now to see if anyone can come up with a simpler solution.

CodePudding user response：

What about that ?

import pandas

table = pandas.read_csv("data.csv", sep=";", decimal=",")

print(table["Amount"][0])  # -36.37
print(type(table["Amount"][0]))   # <class 'numpy.float64'>
print(table["Amount"][0]   36.37)  # 0.0

Pandas automatically detects a number and converts it to numpy.float64.