read_csv with dypes, thousands and keep_default

I have an exported Excel CSV file with str(date), str, float, float, float, float, int as column values. Some of the Excel cells are empty, thus using keep_default_na is needed. Some are in double quotes, thousand separators present. The number of parameters seems to confuse the pandas parser because running this outputs:

ValueError: could not convert string to float: '1,917.6'

Seems that when keep_default_na is present, thousands gets ignored. When I run this without any ,,,,,, lines in the csv, it works perfectly.

CSV FILE:

TS
Date,Symbol,Open,High,Low,Close,Volume
6/14/2022 23:59,A,918.1,918.1,918.1,918.1,1
,,,,,,
6/14/2022 23:57,A,"1,917.6",917.6,917.6,917.6,1
,,,,,,
,,,,,,

CODE

df = pd.read_csv('test.csv',
                 skiprows=1,
                 quotechar='"',
                 thousands=',', 
                 keep_default_na=False,
                 dtype = {'Open': np.float64, 'High': np.float64, 
                          'Low': np.float64, 'Close': np.float64,
                          'Volume': np.uint32, 'Symbol': 'string'})

CodePudding user response：

I think the problem may be that the routine which reports which values are the problem isn't as sophisticated as the full parsing engine - I think you have multiple dtype issues in each column and pandas is telling you the wrong value is the actual issue. float columns can't have the value "" (so it needs to be included in a list of na_values), and int columns can't have NA values, so you either need to skip the footer, or provide na_values and use float dtypes. Both of the following work for me:

skipfooter works with your arguments, but falls back to the python engine:

In [26]: df = pd.read_csv('test.csv',
    ...:                  skiprows=1,
    ...:                  quotechar='"',
    ...:                  thousands=",",
    ...:                  keep_default_na=False,
    ...:                  dtype = {'Open': np.float64, 'High': np.float64,
    ...:                           'Low': np.float64, 'Close': np.float64,
    ...:                           'Volume': np.uint32, 'Symbol': 'string'},
    ...:                  skipfooter=2,
    ...: )

Alternatively, you could read with float columns and specify [""] as a na_value:

In [29]: df = pd.read_csv('test.csv',
    ...:                  skiprows=1,
    ...:                  quotechar='"',
    ...:                  thousands=",",
    ...:                  keep_default_na=False,
    ...:                  dtype = {'Open': np.float64, 'High': np.float64,
    ...:                           'Low': np.float64, 'Close': np.float64,
    ...:                           'Volume': np.float32, 'Symbol': 'string'},
    ...:                  na_values=[""],
    ...: )

at this point, you could drop rows based on any condition you like, for example, drop rows with NaNs in 'Volume', then convert to int:

In [30]: df = df[df.Volume.notnull()]

In [31]: df["Volume"] = df["Volume"].astype(int)

Alternatively, you could set an invalid value marker, e.g.:

In [32]: df["Volume"] = df["Volume"].fillna(-9999).astype(int)