The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What is this error and how to fix it?
import pandas as pd
import dask
import dask.dataframe as dd
import datetime
pdf = pd.DataFrame({
'id2': [1, 1, 1, 2, 2],
'balance': [150, 140, 130, 280, 260],
'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1),
datetime.datetime(2021,5,1), datetime.datetime(2021,1,1),
datetime.datetime(2021,2,1)]
})
ddf = dd.from_pandas(pdf, npartitions=1)
def func2(obj):
m = obj.date2.dt.month
if m > 10:
return 1
else:
return 2
ddf2 = ddf.map_partitions(func2, meta=int)
ddf2.compute() # <-- fails here
CodePudding user response:
By using .map_partition
, each dask dataframe partition (which is a pandas dataframe) is passed to the function func2
. As a result, obj.date2.dt.month
refers to a Series, not a single value, so by running the comparison with the integer, it's not clear to Python whether how to determine the validity of the comparison.
As one option, below is a snippet that creates a new column, conditional on dt.month
result:
import pandas as pd
import dask
import dask.dataframe as dd
import datetime
pdf = pd.DataFrame({
'id2': [1, 1, 1, 2, 2],
'balance': [150, 140, 130, 280, 260],
'date2' : [datetime.datetime(2021,3,1), datetime.datetime(2021,4,1),
datetime.datetime(2021,5,1), datetime.datetime(2021,1,1),
datetime.datetime(2021,2,1)]
})
ddf = dd.from_pandas(pdf, npartitions=1)
def func2(obj):
m = obj.date2.dt.month
obj.loc[m>10, 'new_int']=1
obj.loc[m<=10, 'new_int']=2
return obj
ddf2 = ddf.map_partitions(func2)
ddf2.compute()