I have created a loop which does the job but is very inefficient in execution.
Example Dataset:
import pandas as pd
df = pd.DataFrame([[6039,'ABC',2],[1319,'DEF',2],[6039,'ABC',2],[2811,'DEF',2],[9256223,'XYZ',2],[7021,'ABC',3],[1302,'DEF',3],[3344,'ABC',4],[7648411,'XYZ',4],[1190,'DEF',4],[7648411,'XYZ',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'DEF',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'DEF',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'XYZ',5],[1318,'DEF',5],[7648411,'XYZ',5],[2932,'DEF',5]
], columns=['A', 'B','C'])
df
A B C
0 6039 ABC 2
1 1319 DEF 2
2 6039 ABC 2
3 2811 DEF 2
4 9256223 XYZ 2
5 7021 ABC 3
6 1302 DEF 3
7 3344 ABC 4
8 7648411 XYZ 4
9 1190 DEF 4
10 7648411 XYZ 4
11 1318 DEF 4
12 7648411 XYZ 4
13 2932 DEF 4
14 1318 DEF 4
15 7648411 XYZ 4
16 2932 DEF 4
17 1318 DEF 4
18 7648411 XYZ 4
19 2932 XYZ 5
20 1318 DEF 5
21 7648411 XYZ 5
22 2932 DEF 5
The code I am applying:
df['D']=0
for i in range(len(df)):
if i==0 and df["B"][i]!="XYZ":
df.loc[i,'D']=df["A"][i]
else:
if df["C"][i]!=df["C"][i-1] and df["B"][i]!="XYZ":
df['D'][i]=df["A"][i]
else:
df['D'][i]=df['D'][i-1]
Result:
A B C D
0 6039 ABC 2 6039
1 1319 DEF 2 6039
2 6039 ABC 2 6039
3 2811 DEF 2 6039
4 9256223 XYZ 2 6039
5 7021 ABC 3 7021
6 1302 DEF 3 7021
7 3344 ABC 4 3344
8 7648411 XYZ 4 3344
9 1190 DEF 4 3344
10 7648411 XYZ 4 3344
11 1318 DEF 4 3344
12 7648411 XYZ 4 3344
13 2932 DEF 4 3344
14 1318 DEF 4 3344
15 7648411 XYZ 4 3344
16 2932 DEF 4 3344
17 1318 DEF 4 3344
18 7648411 XYZ 4 3344
19 2932 XYZ 5 3344
20 1318 DEF 5 3344
21 7648411 XYZ 5 3344
22 2932 DEF 5 3344
I am getting the correct result in column D, but the process is highly inefficient over more than 80K rows.
Is there a better way of accomplishing this? Maybe making a function and using apply method to it?
something in this format maybe?:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2, x.col_3), axis=1)
I am not sure, how to apply this?
CodePudding user response:
EDIT: with new input:
df["D"] = df.groupby(((df.C != df.C.shift()) & (df["B"] != "XYZ")).cumsum())["A"].transform("first")
print(df)
Prints:
A B C D
0 6039 ABC 2 6039
1 1319 DEF 2 6039
2 6039 ABC 2 6039
3 2811 DEF 2 6039
4 9256223 XYZ 2 6039
5 7021 ABC 3 7021
6 1302 DEF 3 7021
7 3344 ABC 4 3344
8 7648411 XYZ 4 3344
9 1190 DEF 4 3344
10 7648411 XYZ 4 3344
11 1318 DEF 4 3344
12 7648411 XYZ 4 3344
13 2932 DEF 4 3344
14 1318 DEF 4 3344
15 7648411 XYZ 4 3344
16 2932 DEF 4 3344
17 1318 DEF 4 3344
18 7648411 XYZ 4 3344
19 2932 XYZ 5 3344
20 1318 DEF 5 3344
21 7648411 XYZ 5 3344
22 2932 DEF 5 3344
CodePudding user response:
Edit: I think this will do it.
df.groupby(((df.C != df.C.shift()) & (df.B != 'XYZ')).cumsum()).apply(lambda frame: frame.assign(D=frame.A.iloc[0]))
The complex condition sets boolean flags at every position where a new C
value occurs as long as B
is not XYZ
. We turn that into groups with cumsum
and et voila. For the .apply()
part, Andrej Keseley's .transform("first")
solution is much more elegant.