Trying to get rid of for loop to speedup the execution in filling values in Column 'C' based on if, elif conditions involving multiple columns and rows. Not able to find a proper solution.
tried applying np.where with conditions, choices and default values. But failed to get expected results as i was unable to extract individual values from pandas series object.
df = pd.DataFrame()
df['A']=['Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','Yes']
df['B']=[1,1,0,1,1,0,1,0,0,1]
df['C']=None
df['D']=['xyz','Yes','No','xyz','Yes','No','xyz','Yes','Yes','Yes']
df['C'][0]='xyz'
for i in range(0,len(df)-1):
if (df.iloc[1 i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1 i, 0]=="No"):
df.iloc[1 i, 2] = "Minus"
elif (df.iloc[1 i, 1]==1) & (df.iloc[i, 2]=="xyz") & (df.iloc[1 i, 0]=="Yes"):
df.iloc[1 i, 2] = "Plus"
elif (df.iloc[i, 3]!="xyz") or ((df.iloc[1 i, 1]==0) & (df.iloc[i, 2]=="xyz")):
df.iloc[1 i, 2] = "xyz"
elif (df.iloc[1 i, 0]=="Yes") & (df.iloc[i, 2]=="xyz"):
df.iloc[1 i, 2] = "Plus"
elif (df.iloc[1 i, 0]=="No") & (df.iloc[i, 2]=="xyz"):
df.iloc[1 i, 2] = "Minus"
else:
df.iloc[1 i, 2] = df.iloc[i, 2]
df
Expecting help from community in modifying the above code in to a better one with less execution time. Preferably with numpy Vectorization.
CodePudding user response:
The loop can certainly not be efficiently vectorized using Numpy or Pandas because there is a loop carried data dependency on df['C']
. The loop is very slow because of Pandas direct indexing and string comparisons. Hopefully, you can use Numba to solve this problem efficiently. You first need to convert the columns into strongly-typed Numpy arrays so Numba can be useful. Note that Numba is pretty slow to deal with strings so it is better to perform vectorized check directly with Numpy.
Here is the resulting code:
import numpy as np
import numba as nb
@nb.njit('UnicodeCharSeq(8)[:](bool_[:], int64[:], bool_[:])')
def compute(a, b, d):
n = a.size
c = np.empty(n, dtype='U8')
c[0] = 'xyz'
for i in range(0, n-1):
prev_is_xyz = c[i] == 'xyz'
if b[i 1]==1 and prev_is_xyz and not a[i 1]:
c[i 1] = 'Minus'
elif b[i 1]==1 and prev_is_xyz and a[i 1]:
c[i 1] = 'Plus'
elif d[i] or (b[i 1]==0 and prev_is_xyz):
c[i 1] = 'xyz'
elif a[i 1] and prev_is_xyz:
c[i 1] = 'Plus'
elif not a[i 1] and prev_is_xyz:
c[i 1] = 'Minus'
else:
c[i 1] = c[i]
return c
# Convert the dataframe columns to fast Numpy arrays and precompute some check
a = df['A'].values.astype('U8') == 'Yes'
b = df['B'].values.astype(np.int64)
d = df['D'].values.astype('U8') != 'xyz'
# Compute the result very quickly with Numba
c = compute(a, b, d)
# Store the result back
df['C'].values[:] = c.astype(object)
Here is the resulting performance on my machine:
Basic Pandas loops: 2510 us
This Numba code: 20 us
Thus, the Numba implementation is 125 times faster. In fact, most of the time is spent in the Numpy conversion code and not even in compute
. The gap should be even bigger on large dataframes.
Note that the line df['C'].values[:] = c.astype(object)
is much faster than the equivalent expression df['C'] = c
(about 16 times).