Home > Blockchain >  Dask: How to create 10000 columns in a dask dataframe with improved performance?
Dask: How to create 10000 columns in a dask dataframe with improved performance?

Time:10-19

I have a dask dataframe and I would like to add 10000 columns to it. Below is what I tried,

series_dict = {}
for i in range(0,10000):
    series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
df.assign(**series_dict)

However, it just hangs before on assign itself. How to improve it?

Note: This is simplified lambda function but in real case, I will have complicated functions

CodePudding user response:

In your example, each of your series is made up or operations on existing series, requiring fragments of task graph to be generated. You are far better using a single task generator and operating no the contained pandas dataframes:

def create_columns(df):
    series_dict = {}
    for i in range(0,10000):
        series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
    return df.assign(**series_dict)

new_df = df.map_partitions(create_columns)
  • Related