Home > Software engineering >  Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol
Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol

Time:02-28

UPDATE: When initally posting the question, I did not realize that . is used in some entries as the digit grouping symbol. However, this information is crucial to the problem. Here is the original question:

I am currently trying to import a csv file in pandas that has a less than perfect format, as all the values including numbers are quoted. The format looks like this:

Date;Type;Amount;Currency;Category;Person;Account;Counter Account;Group;Note
"19.02.17";"Expenses";"-36,37";"EUR";"Groceries";"";"Bank account";"";"";""

Now, I have tried importing this using the following command:

import pandas

dtypes = {"Type":"string", "Amount": "float"}
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], dtype = dtypes, quoting = 3)

So I have basically been trying to tell pandas that the decimal separator is comma, that the field delimiter is semicolon, and that the column "Amount" should be parsed as floats. However, trying to parse the file, I still get the error message:

ValueError: could not convert string to float: '689,15'"

I assume the combination of the quotes and the comma decimal separator somehow is too much for pandas, even though I think I have technically provided it with all the information it needs.

The file is an export from a third-party program, so unfortunately I have no influence on the format. Does anyone know how to get pandas to swallow this?

Bonus question: If I read this file without providing explicit data types, I don't get any columns of type "string" as I would have expected, but instead "object" is used. Why is that?

CodePudding user response:

What about that ?

import pandas

table = pandas.read_csv("data.csv", sep=";", decimal=",")

print(table["Amount"][0])  # -36.37
print(type(table["Amount"][0]))   # <class 'numpy.float64'>
print(table["Amount"][0]   36.37)  # 0.0

Pandas automatically detects a number and converts it to numpy.float64.


Edit:

As @bweber discovered, some values in data.csv ​​contained more than 3 digits, and used a digit grouping symbol '.'. In order to convert the String to Integer, the symbol used must be passed to the read_csv() method:

table = pandas.read_csv("data.csv", sep=";", decimal=",", thousands='.')

CodePudding user response:

UPDATE: While this answer still works, there is an easier solution. I initially did not realise that some of the entries use . as a digit grouping symbol, hence the confusion.

Ok, I found a way that works using the converters argument to pass a lambda that does the conversion. I wasn't able to find a simpler solution so far.

toFloat = lambda x: float(x.replace(".", "").replace(",", "."))
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], converters = {"Amount": toFloat})

The lambda toFloat takes a string, removes all occurrences of the . character (used as digit grouping symbol), replaces all occurrences of , by . and then converts the resulting string to float. This lambda is then passed as a converter for the column "Amount" to the read_csv function. Probably now also the decimal argument is expendable.

I am going to leave this question open for now to see if anyone can come up with a simpler solution.

  • Related