UPDATE: When initally posting the question, I did not realize that .
is used in some entries as the digit grouping symbol. However, this information is crucial to the problem. Here is the original question:
I am currently trying to import a csv file in pandas that has a less than perfect format, as all the values including numbers are quoted. The format looks like this:
Date;Type;Amount;Currency;Category;Person;Account;Counter Account;Group;Note
"19.02.17";"Expenses";"-36,37";"EUR";"Groceries";"";"Bank account";"";"";""
Now, I have tried importing this using the following command:
import pandas
dtypes = {"Type":"string", "Amount": "float"}
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], dtype = dtypes, quoting = 3)
So I have basically been trying to tell pandas that the decimal separator is comma, that the field delimiter is semicolon, and that the column "Amount" should be parsed as floats. However, trying to parse the file, I still get the error message:
ValueError: could not convert string to float: '689,15'"
I assume the combination of the quotes and the comma decimal separator somehow is too much for pandas, even though I think I have technically provided it with all the information it needs.
The file is an export from a third-party program, so unfortunately I have no influence on the format. Does anyone know how to get pandas to swallow this?
Bonus question: If I read this file without providing explicit data types, I don't get any columns of type "string" as I would have expected, but instead "object" is used. Why is that?
CodePudding user response:
What about that ?
import pandas
table = pandas.read_csv("data.csv", sep=";", decimal=",")
print(table["Amount"][0]) # -36.37
print(type(table["Amount"][0])) # <class 'numpy.float64'>
print(table["Amount"][0] 36.37) # 0.0
Pandas automatically detects a number and converts it to numpy.float64
.
Edit:
As @bweber discovered, some values in data.csv
contained more than 3 digits, and used a digit grouping symbol '.
'. In order to convert the String to Integer, the symbol used must be passed to the read_csv() method:
table = pandas.read_csv("data.csv", sep=";", decimal=",", thousands='.')
CodePudding user response:
UPDATE: While this answer still works, there is an easier solution. I initially did not realise that some of the entries use .
as a digit grouping symbol, hence the confusion.
Ok, I found a way that works using the converters
argument to pass a lambda that does the conversion. I wasn't able to find a simpler solution so far.
toFloat = lambda x: float(x.replace(".", "").replace(",", "."))
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], converters = {"Amount": toFloat})
The lambda toFloat
takes a string, removes all occurrences of the .
character (used as digit grouping symbol), replaces all occurrences of ,
by .
and then converts the resulting string to float. This lambda is then passed as a converter for the column "Amount" to the read_csv
function. Probably now also the decimal
argument is expendable.
I am going to leave this question open for now to see if anyone can come up with a simpler solution.