I'm trying to replicate this datashader parquet dask dash example in order to do something similar with my own data. Here's the git code.
The steps to replicate include running the jupyter notebook in order to convert the 4 gig csv file into a parquet file. I can run this code without issue, it creates a parquet directory with many ~70 mb sized files in it, but when I try to read the parquet file it returns an empty dataframe (but with the correct columns). So after reading the csv into a dask dataframe and doing some processing I can check the head():
ddf.head()
radio mcc net area cell unit lon lat range samples changeable created updated averageSignal x_3857 y_3857
0 UMTS 262 2 801 86355 0 13.285512 52.522202 1000 7 1 1282569574000000000 1300155341000000000 0 1.478936e 06 6.895103e 06
1 GSM 262 2 801 1795 0 13.276907 52.525714 5716 9 1 1282569574000000000 1300155341000000000 0 1.477979e 06 6.895745e 06
2 GSM 262 2 801 1794 0 13.285064 52.524000 6280 13 1 1282569574000000000 1300796207000000000 0 1.478887e 06 6.895432e 06
3 UMTS 262 2 801 211250 0 13.285446 52.521744 1000 3 1 1282569574000000000 1299466955000000000 0 1.478929e 06 6.895019e 06
4 UMTS 262 2 801 86353 0 13.293457 52.521515 1000 2 1 1282569574000000000 1291380444000000000 0 1.479821e 06 6.894977e 06
write it to parquet:
# Write parquet file to ../data directory
os.makedirs('./data', exist_ok=True)
parquet_path = './data/cell_towers.parq'
ddf.to_parquet(parquet_path,
compression='snappy',
write_metadata_file = True)
and attempt to read from parquet:
ddy = dd.read_parquet('./data/cell_towers.parq' )
but it returns and empty dataframe, but with the correct column names:
ddy.head(3)
> radio mcc net area cell unit lon lat range samples changeable created updated averageSignal x_3857 y_3857
len(ddy)
> 0
This is the first time I'm using dask dataframes and parquet, it seems like it should just work but there might be some basic concept I'm missing here.
Small replicable code snippet:
import pandas as pd
import dask.dataframe as dd
ddfx = dd.from_pandas(pd.DataFrame(range(10), columns=['A']), npartitions=2)
parquet_path = './dummy.parq'
ddfx.to_parquet(parquet_path,
compression='snappy',
write_metadata_file = True)
ddfy = dd.read_parquet('./dummy.parq' )
print('Input DDF length: {0} . Output DDF length: {1}'.format(len(ddfx), len(ddfy)))
Input DDF length: 10 . Output DDF length: 0
How can I write a DDF to parquet and then read it?
CodePudding user response:
I am unable to reproduce the error using dask=2022.05.2
. There might be some version incompatibility, so I'd recommend installing dask, pandas and fastparquet in a dedicated environment.