I am trying to convert some Pandas code to Dask.
I have a dataframe that looks like the following:
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2
0 1 1
1 1 0
2 1 1
3 1 1
4 1 1
In Pandas, I can use create a Lists
column which includes the List
if the row value is 1
like so:
df['Lists'] = df.dot(df.columns ",").str.rstrip(",").str.split(",")
So the Lists
column looks like:
Lists
0 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
1 [ListView_Lead_MyUnreadLeads]
2 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
3 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
4 [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
In Dask, the dot
function doesn't seem to work the same way. How can I get the same behavior / output?
Any help would be appreciated. Thanks!
Related question in Pandas: How to return headers of columns that match a criteria for every row in a pandas dataframe?
CodePudding user response:
Here's some alternative ways to do it in Pandas. You can try whether it works equally well in Dask.
cols = df.columns.to_numpy()
df['Lists'] = [list(cols[x]) for x in df.eq(1).to_numpy()]
or try:
df['Lists'] = df.eq(1).apply(lambda x: list(x.index[x]), axis=1)
The first solution using list comprehension provides better performance if your dataset is large.
Result:
print(df)
ListView_Lead_MyUnreadLeads ListView_Lead_ViewCustom2 Lists
0 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
1 1 0 [ListView_Lead_MyUnreadLeads]
2 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
3 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
4 1 1 [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
CodePudding user response:
Here's a Dask version with map_partitions
:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'ListView_Lead_MyUnreadLeads': [1,1,1,1,1], 'ListView_Lead_ViewCustom2': [1,0,1,1,1] })
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
df = df.copy()
df['Lists'] = df.dot(df.columns ",").str.rstrip(",").str.split(",")
return df
ddf.map_partitions(myfunc).compute()