I have a dataframe, and based on certain condition, I need to add two calculated fields to the dataframe. I can do this in two statements, each one at a time. Is there a way to add more than one fields at the same time? Is there any performance difference in these two approaches?
import pandas as pd
df=pd.DataFrame([{"Name":"Axe","x":10,"y":20},{"Name":"Tree","x":50,"y":15},{"Name":"Sand","x":-10,"y":-15}])
df.loc[df["x"] > 0, "SUM"] = df["x"] df["y"]
df.loc[df["x"] > 0, "DIFF"] = df["x"] - df["y"]
df.head()
Name x y SUM DIFF
0 Axe 10 20 30.0 -10.0
1 Tree 50 15 65.0 35.0
2 Sand -10 -15 NaN NaN
CodePudding user response:
You could use numpy.c_ to concatenate two series along the second axis while using a list of column names in loc
method:
import pandas as pd
import numpy as np
df.loc[df['x'] > 0, ['sum', 'diff']] = pd.DataFrame(np.c_[df['x'] df['y'], df['x'] - df['y']], columns = ['sum', 'diff'])
Name x y sum diff
0 Axe 10 20 30.0 -10.0
1 Tree 50 15 65.0 35.0
2 Sand -10 -15 NaN NaN
CodePudding user response:
You can do this.
df=pd.DataFrame([{"Name":"Axe","x":10,"y":20, },{"Name":"Tree","x":50,"y":15},{"Name":"Sand","x":-10,"y":-15}])
start_time = time.time()
df.loc[df["x"] > 0, ["SUM", "DIFF"]] = [(df.loc[df["x"] > 0]['x'] df.loc[df["x"] > 0]['y']),(df.loc[df["x"] > 0]['x'] - df.loc[df["x"] > 0]['y'])]
print("--- %s seconds ---" % (time.time() - start_time))
df.head()
Your dataset doesn't allow to determine which approach is best so I created a dataset for testing. This is the result:
sampl_x = np.random.uniform(low=-10, high=30, size=1000)
sampl_y = np.random.uniform(low=-10, high=30, size=1000)
df=pd.DataFrame(data={"x":sampl_x, "y":sampl_y })
Single line:
start_time = time.time()
df.loc[df["x"] > 0, ["SUM", "DIFF"]] = np.array([df.loc[df["x"] > 0]['x'] df.loc[df["x"] > 0]['y'] , df.loc[df["x"] > 0]['x'] - df.loc[df["x"] > 0]['y']]).T
print("--- %s seconds ---" % (time.time() - start_time))
df.head()
--- 0.006986141204833984 seconds ---
Two line:
start_time = time.time()
df.loc[df["x"] > 0, "SUM"] = df["x"] df["y"]
df.loc[df["x"] > 0, "DIFF"] = df["x"] - df["y"]
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.003948688507080078 seconds ---
I tried to run the test several times but the second is always more efficient.