Consider the following code
import dask
import dask.dataframe as dd
import pandas as pd
data_dict = {'data1':[1,2,3,4,5,6,7,8,9,10]}
df_pd = pd.DataFrame(data_dict)
df_dask = dd.from_pandas(df_pd,npartitions=2)
df_dask['data1x2'] = df_dask['data1'].apply(lambda x:2*x,meta=('data1x2','int64')).compute()
print('-'*80)
print(df_dask['data1x2'])
print('-'*80)
print(df_dask['data1x2'].compute())
print('-'*80)
What I can't figure out is: why is there a difference between the output of the first and second print? After all, I called compute when I applied the function and stored the result in df_dask['data1x2'].
CodePudding user response:
The first print will only show the lazy version of the dask series, df_dask["data1x2"]
:
Dask Series Structure:
npartitions=2
0 int64
5 ...
9 ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks
This shows the number of partitions, index values (if known), number of tasks needed to be done to get the final result, and some other information. At this stage, dask did not compute the actual series, so the values inside this dask array are not known. Calling .compute
launches computation of the 15 tasks needed to get the actual values and that's what is printed the second time.