Home > Enterprise >  Why pandas dataframe doesn't change when i used it as a input of a function with multiprocessin
Why pandas dataframe doesn't change when i used it as a input of a function with multiprocessin

Time:11-21

I have a code like this:

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    }
)

def changeDF(df):
    df['Signal'] = 0

changeDF(df1)
changeDF(df2)

when I run above, (changeDf) function add a column to df1 and df2 named 'Signal' with 0 values. but instead of run (changeDf) directly using multiprocessing like below it doesn't change any dfs.

s = [df1, df2]
with multiprocessing.Pool(processes=2) as pool:
    res = pool.map(changeDF, s)

What's wrong with my code?

CodePudding user response:

Serializing df1 & df2 for multiprocessing means that you're making a copy.

Return your dataframe from the function and it'll work fine.

def changeDF(df):
    df['Signal'] = 0
    return(df)

with multiprocessing.Pool(processes=2) as pool:
    df1, df2 = pool.map(changeDF, [df1, df2])

I would warn you that the serialization costs of this will certainly be higher than the benefit you get from multiprocessing.

CodePudding user response:

Change your function changeDF to be like this:

def changeDF(df):
    df['Signal'] = 0
    return df
  • Related