I have a code like this:
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
}
)
def changeDF(df):
df['Signal'] = 0
changeDF(df1)
changeDF(df2)
when I run above, (changeDf) function add a column to df1 and df2 named 'Signal' with 0 values. but instead of run (changeDf) directly using multiprocessing like below it doesn't change any dfs.
s = [df1, df2]
with multiprocessing.Pool(processes=2) as pool:
res = pool.map(changeDF, s)
What's wrong with my code?
CodePudding user response:
Serializing df1
& df2
for multiprocessing means that you're making a copy.
Return your dataframe from the function and it'll work fine.
def changeDF(df):
df['Signal'] = 0
return(df)
with multiprocessing.Pool(processes=2) as pool:
df1, df2 = pool.map(changeDF, [df1, df2])
I would warn you that the serialization costs of this will certainly be higher than the benefit you get from multiprocessing.
CodePudding user response:
Change your function changeDF
to be like this:
def changeDF(df):
df['Signal'] = 0
return df