Dask: How to create 10000 columns in a dask dataframe with improved performance?-CodePudding

I have a dask dataframe and I would like to add 10000 columns to it. Below is what I tried,

series_dict = {}
for i in range(0,10000):
    series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
df.assign(**series_dict)

However, it just hangs before on assign itself. How to improve it?

Note: This is simplified lambda function but in real case, I will have complicated functions

CodePudding user response：

In your example, each of your series is made up or operations on existing series, requiring fragments of task graph to be generated. You are far better using a single task generator and operating no the contained pandas dataframes:

def create_columns(df):
    series_dict = {}
    for i in range(0,10000):
        series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
    return df.assign(**series_dict)

new_df = df.map_partitions(create_columns)