Home > Software engineering >  does dask compute store results?
does dask compute store results?

Time:04-04

Consider the following code

import dask
import dask.dataframe as dd
import pandas as pd

data_dict = {'data1':[1,2,3,4,5,6,7,8,9,10]}
df_pd     = pd.DataFrame(data_dict) 
df_dask   = dd.from_pandas(df_pd,npartitions=2)

df_dask['data1x2'] = df_dask['data1'].apply(lambda x:2*x,meta=('data1x2','int64')).compute()

print('-'*80)
print(df_dask['data1x2'])
print('-'*80)
print(df_dask['data1x2'].compute())
print('-'*80)

What I can't figure out is: why is there a difference between the output of the first and second print? After all, I called compute when I applied the function and stored the result in df_dask['data1x2'].

CodePudding user response:

The first print will only show the lazy version of the dask series, df_dask["data1x2"]:

Dask Series Structure:
npartitions=2
0    int64
5      ...
9      ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks

This shows the number of partitions, index values (if known), number of tasks needed to be done to get the final result, and some other information. At this stage, dask did not compute the actual series, so the values inside this dask array are not known. Calling .compute launches computation of the 15 tasks needed to get the actual values and that's what is printed the second time.

  • Related