I have a Pandas dataframe with 3 columns. There is a series of booleans, a series of values, and a column that I want to fill, C. I also have an initial value for C.
A B C
----------------------
True 10 100
False 20 NaN
True 25 NaN
True 28 NaN
...
I want the values of column C (for C[1:]) to follow the following rule.
if A[i - 1]:
C[i] = C[i - 1] * B[i] / B[i - 1]
else:
C[i] = C[i - 1]
Of course this formula cannot determine C[0], but C[0] is provided.
Is there a way to do this efficiently using vectorized operations?
What I've tried:
The following command doesn't account for the sequential nature of the operation.
df.loc[df.A , 'C'] = df.C.shift(1) * df.B / df.B.shift(1)
df.loc[df.A == 0, 'C'] = df.C.shift(1)
If I were to do use an apply function to compute this I would have to probably make new shifted columns like the following, and then only run the apply for rows [1:]? But how do I get the updated previous value of C?
df["s_A"] = df.A.shift(1)
df["s_B"] = df.B.shift(1)
df["s_C"] = df.C.shift(1)
df["s_A"][0] = False; # this assumption is okay within the purposes
Should this work and is there a faster way? There may be up to 400,000 rows in total across multiple dataframes but it is not super time sensitive.
For clarity's sake I will mention that there are around 12 columns total, but only these three are relevant to this operation.
Is it possible to vectorize this operation? Is there another way it can be solved?
Thanks.
CodePudding user response:
IIUC, you can try something like this:
df['C'] = np.where(df['A'].shift().bfill(),
df['C'].shift().bfill()*df['B']/df['B'].shift().bfill(),
df['C'].shift().bfill())
CodePudding user response:
I think it is difficult to vectorize recursive algebra.
The general way is do it recursively
A = df['A'].to_numpy()
B = df['B'].to_numpy()
C = df['C'].to_numpy()
for i in np.arange(1, len(A)):
C[i] = C[i-1] if not A[i-1] else C[i-1] * B[i] / B[i-1]
df['A'] = A
df['B'] = B
df['C'] = C
Or, after analyzing your case, it can be worked out as a cumulative product problem, which can be solved by:
df['C'] = df['C'].fillna(
df['A'].shift(1) * df['B'] / df['B'].shift(1) \
(1 - df['A'].shift(1))
).cumprod()
Both ways will yield the same result.
A B C
0 True 10 100.0
1 False 20 200.0
2 True 25 200.0
3 True 28 224.0