Convert pandas to dask code and it errors out-CodePudding

I have pandas code which works perfectly.

import pandas as pd                                              
                                                                 
courses_df = pd.DataFrame(                                       
    [                                                            
        ["Jay", "MS"],                                           
        ["Jay", "Music"],                                        
        ["Dorsey", "Music"],                                     
        ["Dorsey", "Piano"],                                     
        ["Mark", "MS"],                                          
    ],                                                           
    columns=["Name", "Course"],                                  
)                                                                
                                                                 
pandas_df_json = (                                               
    courses_df.groupby(["Name"])                                 
    .apply(lambda x: x.drop(columns="Name").to_json(orient="records"))                
    .reset_index(name="courses_json")                            
)

But when I convert the dataframe to Dask and try the same operation.

from dask import dataframe as dd  
df = dd.from_pandas(courses_df, npartitions=2)                                                           
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(                           
    name="courses_json"                                                                                  
).compute()

And the error i get is

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [37], in <module>
      1 from dask import dataframe as dd
      3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
      5     name="courses_json"
      6 ).compute()

TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'

My expected output from dask and pandas should be same that is

     Name                             courses_json
0  Dorsey  [{"Course":"Music"},{"Course":"Piano"}]
1     Jay     [{"Course":"MS"},{"Course":"Music"}]
2    Mark                        [{"Course":"MS"}]

How do i achieve this in dask ?

My try so far

from dask import dataframe as dd                                             
                                                                             
df = dd.from_pandas(courses_df, npartitions=2)                               
df.groupby(["Name"]).apply(                                                  
    lambda x: x.drop(columns="Name").to_json(orient="records")               
).compute()                           
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.                
  Before: .apply(func)                                                                                                                                                 
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result                                                                                               
  or:     .apply(func, meta=('x', 'f8'))            for series result                                                                                                  
  df.groupby(["Name"]).apply(                                                                                                                                          
Out[57]:                                                                                                                                                               
Name                                                                                                                                                                   
Dorsey    [{"Course":"Piano"},{"Course":"Music"}]                                                                                                                      
Jay          [{"Course":"MS"},{"Course":"Music"}]                                                                                                                      
Mark                            [{"Course":"MS"}]                                                                                                                      
dtype: object

I want to pass in a meta arguement and also want the second column

to have a meaningful name like courses_json

CodePudding user response：

For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:

https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html

This should solve the warning:

from dask import dataframe as dd                                             
                                                                             
df = dd.from_pandas(courses_df, npartitions=2)                               
new_df = df.groupby(["Name"]).apply(
     lambda x: x.drop(columns="Name").to_json(orient="records"),
     meta=("Name", "O") 
).to_frame()

# rename columns
new_df.columns = ["courses_json"]

# use numeric int index instead of name as in the given example 
new_df = new_df.reset_index()

new_df.compute()

The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.