Unable to convert comma separated integers and non-integer values to float in a series column in Pyt-CodePudding

Loading in the data

in: import pandas as pd

in: df = pd.read_csv('name', sep = ';',  encoding='unicode_escape')

in : df.dtypes
out: amount     object

I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.

df['amount']
0           31
1       150,01
2           50
3         54,4
4        32,79
         ...  
4950      25,5
4951      39,5
4952     75,56
4953       5,9
4954     43,69
Name: amount, Length: 4955, dtype: object

Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:

In: import locale 
    setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' -   '

Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.

Unfortunately, due to the comma separated structure, all the values would turn into -1.


0      -1
1      -1
2      -1
3      -1
4      -1
       ..
4950   -1
4951   -1
4952   -1
4953   -1
4954   -1
Name: amount, Length: 4955, dtype: int64

How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.

Please help!

Documentation that I came across

-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas

-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas

p.s. This is my first post, please be kind

CodePudding user response：

Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:

data.csv

thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22

To read it, specify the column, decimal and thousands separator as needed:

import pandas as pd

df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)

Output:

    thing   amount
0  thing1    31.00
1  thing2   150.01
2  thing3    50.00
3  thing4    54.40
4  thing5  1500.22

CodePudding user response：

Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):

Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.

When importing the file, do something like:

with open("C:/path/to/file.csv", 'r') as f:
    data = f.read()
    result = chardet.detect(data.encode())
    charencode = result['encoding']

    # now re-set the handler to the beginning and re-read the file:
    f.seek(0, 0)
    data = pd.read_csv(f, delimiter=';', encoding=charencode)

Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do

data = pd.read_csv('C:/path/to/file.csv', engine='python')

@Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:

def stripThousands(self, df_column):
    df_column.replace(',', '', regex=True, inplace=True)
    df_column = df_column.apply(pd.to_numeric, errors='coerce')
    return df_column

If you don't care about the entries that are just hyphens, you could use a function like

def screw_hyphens(self, column):
    column.replace(['-'], np.nan, inplace=True)

or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)

**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)