Most efficient for a big dataset in pandas:
I would like to add a new column Z taking the value from X if there is a value, if not, I want to take the value from Y.
Another thing, it there a possibility to use ternary operations to add a new column Z based on, if column Y exist then column Y - column X, if not, then only X.
I'm looking for the most efficient way in both cases.
Thank you
CodePudding user response:
Use numpy.where
:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'X':np.random.choice([np.nan, 1], size=N),
'Y':np.random.choice([3,4,6], size=N)})
df['Z1'] = np.where(df['X'].isna(), df['Y'],df['X'])
if 'Y' in df.columns:
df['Z2'] = np.where(df['X'] - df['Y'], df['Y'],df['X'])
else:
df['Z2'] = df['X']
print (df)
X Y Z1 Z2
0 NaN 6 6.0 6.0
1 1.0 4 1.0 4.0
2 NaN 6 6.0 6.0
3 NaN 3 3.0 3.0
4 NaN 3 3.0 3.0
... .. ... ...
9995 1.0 6 1.0 6.0
9996 1.0 6 1.0 6.0
9997 NaN 6 6.0 6.0
9998 1.0 4 1.0 4.0
9999 1.0 6 1.0 6.0
[10000 rows x 4 columns]