How to add two derived fields in a single statement?-CodePudding

I have a dataframe, and based on certain condition, I need to add two calculated fields to the dataframe. I can do this in two statements, each one at a time. Is there a way to add more than one fields at the same time? Is there any performance difference in these two approaches?

import pandas as pd

df=pd.DataFrame([{"Name":"Axe","x":10,"y":20},{"Name":"Tree","x":50,"y":15},{"Name":"Sand","x":-10,"y":-15}])
df.loc[df["x"] > 0, "SUM"] = df["x"]   df["y"]
df.loc[df["x"] > 0, "DIFF"] = df["x"] - df["y"]
df.head()

    Name    x   y   SUM DIFF
0   Axe 10  20  30.0    -10.0
1   Tree    50  15  65.0    35.0
2   Sand    -10 -15 NaN NaN

CodePudding user response：

You could use numpy.c_ to concatenate two series along the second axis while using a list of column names in loc method:

import pandas as pd
import numpy as np

df.loc[df['x'] > 0, ['sum', 'diff']] = pd.DataFrame(np.c_[df['x']   df['y'], df['x'] - df['y']], columns = ['sum', 'diff'])

   Name   x   y   sum  diff
0   Axe  10  20  30.0 -10.0
1  Tree  50  15  65.0  35.0
2  Sand -10 -15   NaN   NaN

CodePudding user response：

You can do this.

df=pd.DataFrame([{"Name":"Axe","x":10,"y":20, },{"Name":"Tree","x":50,"y":15},{"Name":"Sand","x":-10,"y":-15}])
start_time = time.time()
df.loc[df["x"] > 0, ["SUM", "DIFF"]] = [(df.loc[df["x"] > 0]['x']   df.loc[df["x"] > 0]['y']),(df.loc[df["x"] > 0]['x'] - df.loc[df["x"] > 0]['y'])]
print("--- %s seconds ---" % (time.time() - start_time))
df.head()

Your dataset doesn't allow to determine which approach is best so I created a dataset for testing. This is the result:

sampl_x = np.random.uniform(low=-10, high=30, size=1000)
sampl_y = np.random.uniform(low=-10, high=30, size=1000)
df=pd.DataFrame(data={"x":sampl_x, "y":sampl_y })

Single line:

start_time = time.time()
df.loc[df["x"] > 0, ["SUM", "DIFF"]] = np.array([df.loc[df["x"] > 0]['x']   df.loc[df["x"] > 0]['y'] , df.loc[df["x"] > 0]['x'] - df.loc[df["x"] > 0]['y']]).T
print("--- %s seconds ---" % (time.time() - start_time))
df.head()

--- 0.006986141204833984 seconds ---

Two line:

start_time = time.time()
df.loc[df["x"] > 0, "SUM"] = df["x"]   df["y"]
df.loc[df["x"] > 0, "DIFF"] = df["x"] - df["y"]
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.003948688507080078 seconds ---

I tried to run the test several times but the second is always more efficient.