I'm trying to use Dask to read a large ~500MB CSV file which contains pairs of float and str columns that are read together, resulting in two lists being read into memory at a time. This works without errors, but I get a DtypeWarning since it's a ragged CSV (the column lengths differ significantly between pairs) and it mixes nan (float) with strings before I can filter then out with .dropna().
Data structure (much larger in practice):
Col1: [1, 2, 3, 4, nan, nan, nan] -> (ok, no warning)
Col2: ['A', 'B', 'C', 'D', nan, nan, nan] -> (mixed str and float; DtypeWarning)
I want to avoid using low_memory=False since this significantly impacts execution time (~6x slower), but it feels like good practice to not just ignore the error.
Is there a way to improve on the below example code to prevent the above warning?
import pandas as pd
import dask.dataframe as dd
length = 10000000
ratio = 0.9
A = {'A': [0.07]*length, 'B': ['a']*int(length*ratio) [float('nan')]*(length-int(length*ratio))}
pd.DataFrame(A).to_csv('testing.csv')
df = dd.read_csv('testing.csv')['B'].compute().dropna().tolist()
CodePudding user response:
Some suggestions:
If the file fits into memory, it's often more convenient to use plain
pandas
. The gain from parallel processing ofdask
kicks in for large tasks/files.It might be worth specifying
dtype
explicitly, e.g.dtype='str'
or as appropriate.
CodePudding user response:
Answering my own question:
While I was originally concerned about execution time, I found that splitting the file into 10 smaller csv files led to faster extraction of the headers (~1 second total instead of ~5 seconds for one large file) when using pandas. This puts it on-par with the execution time of dask to prepare its csv "representations". The time to then load the columns that I'm after is actually faster than dask using this method, even when low_memory=False is used (<1 second for pandas vs ~3 seconds for dask).
This seems to address the issue efficiently, especially when compared to loading one large csv in pandas (~60 seconds becomes ~2 seconds).
Summary:
- Split large csv into 10 smaller csv files
- Collect column headers
- Read columns as-needed, using low_memory=False