Home > Software design >  parallelize for loop and merge pandas dataframes
parallelize for loop and merge pandas dataframes

Time:11-23

My script is as follows

import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})

def make_df(year):
    df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], str(year): [str(year), str(year 1), str(year 2), str(year 3)]})
    return df

for year in range(2020, 2015, -1):
        df = pd.merge(df, make_df(year), on=['key'], how='left')

The final df will be..

  key   A  2020  2019  2018  2017  2016
0  K0  A0  2020  2019  2018  2017  2016
1  K1  A1  2021  2020  2019  2018  2017
2  K2  A2  2022  2021  2020  2019  2018
3  K3  A3  2023  2022  2021  2020  2019

my actual make_new_df(year) is much more complex and takes too much time.

How can I paralleize the for-loop for year in range(2020, 2015, -1): and shorten processing time?

CodePudding user response:

I'm not sure you need threading in this case. You can create a list of DataFrames and join them all at once, getting rid of your function:

import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})

nb_row = len(df.index)
df_list = [pd.DataFrame(range(year, year   nb_row), columns=[str(year)]) for year in range(2020, 2015, -1)]
df = df.join(df_list)
Edit: keeping make_df:
def make_df(year):
    df = pd.DataFrame({str(year): [str(year), str(year 1), str(year 2), str(year 3)]})
    return df
df_list = [make_df(year) for year in range(2020, 2015, -1)]
df = df.join(df_list)

CodePudding user response:

After reading your comments it seems that you want a thread for each year:

import threading
import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3']})

year_range = range(2020, 2015, -1)
df_threads = []
df_list = []

def make_df(year):
    df = pd.DataFrame({str(year): [str(year), str(year 1), str(year 2), str(year 3)]})
    return df_list.append(df)

for year in year_range:
    df_threads.append(threading.Thread(target=make_df, args=[year]))
    df_threads[2020 - year].start()

# waiting for threads:
for t in df_threads:
     t.join()

df = df.join(df_list)
print(df)
  • Related