Home > Net >  How to generate array column with values from other columns using Dask Dataframe
How to generate array column with values from other columns using Dask Dataframe

Time:10-05

I am trying to convert some Pandas code to Dask.

I have a dataframe that looks like the following:

   ListView_Lead_MyUnreadLeads  ListView_Lead_ViewCustom2 
0                            1                          1   
1                            1                          0   
2                            1                          1   
3                            1                          1   
4                            1                          1   

In Pandas, I can use create a Lists column which includes the List if the row value is 1 like so:

df['Lists'] = df.dot(df.columns ",").str.rstrip(",").str.split(",")

So the Lists column looks like:

                                               Lists
0  [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
1                      [ListView_Lead_MyUnreadLeads]
2  [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
3  [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...
4  [ListView_Lead_MyUnreadLeads, ListView_Lead_Vi...

In Dask, the dot function doesn't seem to work the same way. How can I get the same behavior / output?

Any help would be appreciated. Thanks!

Related question in Pandas: How to return headers of columns that match a criteria for every row in a pandas dataframe?

CodePudding user response:

Here's some alternative ways to do it in Pandas. You can try whether it works equally well in Dask.

cols = df.columns.to_numpy()
df['Lists'] = [list(cols[x]) for x in df.eq(1).to_numpy()]

or try:

df['Lists'] = df.eq(1).apply(lambda x: list(x.index[x]), axis=1)

The first solution using list comprehension provides better performance if your dataset is large.

Result:

print(df)

   ListView_Lead_MyUnreadLeads  ListView_Lead_ViewCustom2                                                     Lists
0                            1                          1  [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
1                            1                          0                             [ListView_Lead_MyUnreadLeads]
2                            1                          1  [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
3                            1                          1  [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]
4                            1                          1  [ListView_Lead_MyUnreadLeads, ListView_Lead_ViewCustom2]

CodePudding user response:

Here's a Dask version with map_partitions:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'ListView_Lead_MyUnreadLeads': [1,1,1,1,1], 'ListView_Lead_ViewCustom2': [1,0,1,1,1] })

ddf = dd.from_pandas(df, npartitions=2)

def myfunc(df):
    df = df.copy()
    df['Lists'] = df.dot(df.columns ",").str.rstrip(",").str.split(",")
    return df

ddf.map_partitions(myfunc).compute()
  • Related