Dask Value Error on Computation: Cannot Convert String to Float-CodePudding

I have a large dask dataframe and am trying to compute the mean of a column. Each time I try the following command

dask_df.new_deceased.mean(numeric_only=True).compute()

I get ValueError: could not convert string to float: 'new_confirmed'

When I import the CSV, I specify that the datatype of this column is float and when I call

dask_df.new_confirmed

I get

Dask Series Structure:
npartitions=525
    float64
        ...
        ...   
        ...
        ...
Name: new_confirmed, dtype: float64
Dask Name: getitem, 2101 tasks

So clearly the type of the column is correct. An additional confusion is why dask cares new_confirmed is a string when I am computing a statistic on new_deceased?

Any help or advice would be greatly appreciated.

CodePudding user response：

As with pandas, when dask reads a text file, it can't rely on encoding metadata specifying column types like it can when reading parquet files or other binary types. But in dask's case, the entire file can't be parsed just to determine the type metadata - that would defeat the purpose. So dask is stuck with two options. Either you can tell it what the data types are, or it can take a peak at the data, and assume the rest of the data has the same structure.

From the dask.dataframe.read_csv docs:

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.

Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.

Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

Here's a simple illustration of the problem. Let's say we have the following (malformed) CSV file:

In [8]: myfile = '''
   ...: day,new_confirmed
   ...: 0,0
   ...: 1,4
   ...: 2,9
   ...: 3,2
   ...: 4,12
   ...: day,new_confirmed
   ...: 6,18
   ...: 7,19
   ...: 8,13
   ...: 9,20
   ...: '''.strip()

In [9]: with open('myfile.csv', 'w') as f:
   ...:     f.write(myfile)
   ...:

If we tried to read this in with pandas, it would encounter the strings on row 6, cast the new_confirmed column as object type, and you'd know immediately that you have some data cleaning to do.

With dask, the situation is tougher. We'll read this file in, and only sample the first 5 rows to infer types. The dask default is higher, but this is a small example :)

In [10]: df = dask.dataframe.read_csv('myfile.csv', sample_rows=5)

In [11]: df
Out[11]:
Dask DataFrame Structure:
                 day new_confirmed
npartitions=1
               int64         int64
                 ...           ...
Dask Name: read-csv, 1 tasks

In this case, you can see that dask thinks the column is integer type. This is even carried through into computations - when taking the mean, our buggy integer column was converted into a float:

In [12]: df.new_confirmed.mean()
Out[12]: dd.Scalar<series-..., dtype=float64>

We only find out we have a problem later, when computing the result:

In [13]: df.new_confirmed.mean().compute()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 df.new_confirmed.mean().compute()
...
File ~/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/dask/dataframe/io/csv.py:284, in coerce_dtypes(df, dtypes)
    281 msg = "Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.\n\n%s" % (
    282     rule.join(filter(None, [dtype_msg, date_msg]))
    283 )
--> 284 raise ValueError(msg)

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

 --------------- -------- ---------- 
| Column        | Found  | Expected |
 --------------- -------- ---------- 
| new_confirmed | object | int64    |
 --------------- -------- ---------- 

The following columns also raised exceptions on conversion:

- new_confirmed
  ValueError("invalid literal for int() with base 10: 'new_confirmed'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'new_confirmed': 'object'}

to the call to `read_csv`/`read_table`.

The best dask can do is to offer a helpful error message, which it did. So we can use this to understand that we have cleaning work we need to do!

CodePudding user response：

Instructing dask that the dtype is a float, is not sufficient for all partitions to be convertable to float. The reported dtype reflects only what the data is expected to be. If the actual data does not conform to expectation, there will be an error raised during computation.

One way around this is to explicitly convert data to numeric dtype enforcing any errors to be nan:

from dask.dataframe import to_numeric

ddf['float_col'] = to_numeric(ddf['dirty_col'], errors='coerce')
ddf['float_col'].mean().compute()

See this blog post for further details.