Error comparing dask date month with an integer-CodePudding

The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What is this error and how to fix it?

import pandas as pd
import dask
import dask.dataframe as dd
import datetime

pdf = pd.DataFrame({
    'id2': [1, 1, 1, 2, 2],
    'balance': [150, 140, 130, 280, 260],
    'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), 
               datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), 
               datetime.datetime(2021,2,1)]
})

ddf = dd.from_pandas(pdf, npartitions=1) 

def func2(obj):
    m = obj.date2.dt.month
    if m > 10:
        return 1
    else:
        return 2

ddf2 = ddf.map_partitions(func2, meta=int)
ddf2.compute()   # <-- fails here

CodePudding user response：

By using .map_partition, each dask dataframe partition (which is a pandas dataframe) is passed to the function func2. As a result, obj.date2.dt.month refers to a Series, not a single value, so by running the comparison with the integer, it's not clear to Python whether how to determine the validity of the comparison.

As one option, below is a snippet that creates a new column, conditional on dt.month result:

import pandas as pd
import dask
import dask.dataframe as dd
import datetime

pdf = pd.DataFrame({
    'id2': [1, 1, 1, 2, 2],
    'balance': [150, 140, 130, 280, 260],
    'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1), 
               datetime.datetime(2021,5,1), datetime.datetime(2021,1,1), 
               datetime.datetime(2021,2,1)]
})

ddf = dd.from_pandas(pdf, npartitions=1) 

def func2(obj):
    m = obj.date2.dt.month
    obj.loc[m>10, 'new_int']=1
    obj.loc[m<=10, 'new_int']=2
    return obj

ddf2 = ddf.map_partitions(func2)
ddf2.compute()