Fastest way to multiply multiple columns in Dataframe based on conditions-CodePudding

data = [{'a': 12, 'b': 23, 'c':34, 'd': 0.1, 'e':25},
        {'a':13, 'b': 26, 'c': 38, 'd': 0.02, 'e':26},
        {'a':19, 'b': 28, 'c': 31, 'd': 0.04, 'e':22}
       ]
 
# Creates DataFrame.
df = pd.DataFrame(data)

     a   b   c    d     e
0   12  23  34  0.10    25
1   13  26  38  0.02    26
2   19  28  31  0.04    22

I have a very large dataframe consisting of 20 cols and 20million rows, I would like to multiply certain columns by column d.

For example in this case I want to multiply columns a,c, and e by the percentage in column d.I would like to know what is the quickest way to do this

CodePudding user response：

If multiple values selected by list of columns names by DataFrame.mul it is fast:

cols = ['a','c','e']
df[cols] = df[cols].mul(df['d'], axis=0)
print (df)
      a   b     c     d     e
0  1.20  23  3.40  0.10  2.50
1  0.26  26  0.76  0.02  0.52
2  0.76  28  1.24  0.04  0.88

Numpy alternative, but not faster:

cols = ['a','c','e']
df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]

df = pd.DataFrame(data)
#300k rows
df = pd.concat([df] * 100000, ignore_index=True)
print (df)


In [113]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].mul(df['d'], axis=0)
     ...: 
     ...: 
14.5 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [114]: %%timeit
     ...: cols = ['a','c','e']
     ...: df[cols] = df[cols].to_numpy() * df['d'].to_numpy()[:, None]
     ...: 
138 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)