Home > Back-end >  Reading Parquet files in Dask returns empty dataframe
Reading Parquet files in Dask returns empty dataframe

Time:07-05

I'm trying to replicate this datashader parquet dask dash example in order to do something similar with my own data. Here's the git code.

The steps to replicate include running the jupyter notebook in order to convert the 4 gig csv file into a parquet file. I can run this code without issue, it creates a parquet directory with many ~70 mb sized files in it, but when I try to read the parquet file it returns an empty dataframe (but with the correct columns). So after reading the csv into a dask dataframe and doing some processing I can check the head():

ddf.head()
radio   mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857
0   UMTS    262 2   801 86355   0   13.285512   52.522202   1000    7   1   1282569574000000000 1300155341000000000 0   1.478936e 06    6.895103e 06
1   GSM 262 2   801 1795    0   13.276907   52.525714   5716    9   1   1282569574000000000 1300155341000000000 0   1.477979e 06    6.895745e 06
2   GSM 262 2   801 1794    0   13.285064   52.524000   6280    13  1   1282569574000000000 1300796207000000000 0   1.478887e 06    6.895432e 06
3   UMTS    262 2   801 211250  0   13.285446   52.521744   1000    3   1   1282569574000000000 1299466955000000000 0   1.478929e 06    6.895019e 06
4   UMTS    262 2   801 86353   0   13.293457   52.521515   1000    2   1   1282569574000000000 1291380444000000000 0   1.479821e 06    6.894977e 06

write it to parquet:

# Write parquet file to ../data directory
os.makedirs('./data', exist_ok=True)
parquet_path = './data/cell_towers.parq'
ddf.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)

and attempt to read from parquet:

ddy = dd.read_parquet('./data/cell_towers.parq' ) 

but it returns and empty dataframe, but with the correct column names:

ddy.head(3)
> radio mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857

len(ddy)
> 0 

This is the first time I'm using dask dataframes and parquet, it seems like it should just work but there might be some basic concept I'm missing here.

Small replicable code snippet:

import pandas as pd
import dask.dataframe as dd

ddfx = dd.from_pandas(pd.DataFrame(range(10), columns=['A']), npartitions=2)
parquet_path = './dummy.parq'
ddfx.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)
ddfy = dd.read_parquet('./dummy.parq' ) 
print('Input DDF length: {0} . Output DDF length: {1}'.format(len(ddfx), len(ddfy)))

Input DDF length: 10 . Output DDF length: 0

How can I write a DDF to parquet and then read it?

CodePudding user response:

I am unable to reproduce the error using dask=2022.05.2. There might be some version incompatibility, so I'd recommend installing dask, pandas and fastparquet in a dedicated environment.

  • Related