Home > Enterprise >  Combine Pandas DataFrames when using multiprocessing
Combine Pandas DataFrames when using multiprocessing

Time:10-23

I am using multiprocessing, and generating a pandas DataFrame with each process. I would like to merge them together and output the data. The following strategy seems almost work, but when trying to read in the data with df.read_csv() it only uses the first name as a column header.

from multiprocessing import Process, Lock

def foo(name, lock):
    d = {f'{name}': [1, 2]}
    df = pd.DataFrame(data=d)

    lock.acquire()
    try:
        df.to_csv('output.txt', mode='a')
    finally:
        lock.release()

if __name__ == '__main__':
    lock = Lock()

    for name in ['bob','steve']
        p = Process(target=foo, args=(name, lock))
        p.start()
    p.join()

CodePudding user response:

You can use multiprocessing.Pool:

import multiprocessing
import pandas as pd

def foo(name):
    d = {f'{name}': [1, 2]}
    df = pd.DataFrame(data=d)
    return df

if __name__ == '__main__':
    data = ['bob', 'steve']
    with multiprocessing.Pool(2) as pool:
        data = pool.map(foo, data)
    pd.concat(data, axis=1).to_csv('output.csv')

Output:

>>> pd.concat(data, axis=1)
   bob  steve
0    1      1
1    2      2
  • Related