I have a dask dataframe and I would like to add 10000 columns to it. Below is what I tried,
series_dict = {}
for i in range(0,10000):
series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
df.assign(**series_dict)
However, it just hangs before on assign itself. How to improve it?
Note: This is simplified lambda function but in real case, I will have complicated functions
CodePudding user response:
In your example, each of your series is made up or operations on existing series, requiring fragments of task graph to be generated. You are far better using a single task generator and operating no the contained pandas dataframes:
def create_columns(df):
series_dict = {}
for i in range(0,10000):
series_dict[f'ab_{i}'] = lambda x: i * x[f'a'] * x[f'b']
return df.assign(**series_dict)
new_df = df.map_partitions(create_columns)