Pandas: More responsive syntax or function which compares value of a cell in column with previous ce-CodePudding

I have created a loop which does the job but is very inefficient in execution.

Example Dataset:

import pandas as pd

df = pd.DataFrame([[6039,'ABC',2],[1319,'DEF',2],[6039,'ABC',2],[2811,'DEF',2],[9256223,'XYZ',2],[7021,'ABC',3],[1302,'DEF',3],[3344,'ABC',4],[7648411,'XYZ',4],[1190,'DEF',4],[7648411,'XYZ',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'DEF',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'DEF',4],[1318,'DEF',4],[7648411,'XYZ',4],[2932,'XYZ',5],[1318,'DEF',5],[7648411,'XYZ',5],[2932,'DEF',5]
], columns=['A', 'B','C'])
df


     A       B  C
0   6039    ABC 2
1   1319    DEF 2
2   6039    ABC 2
3   2811    DEF 2
4   9256223 XYZ 2
5   7021    ABC 3
6   1302    DEF 3
7   3344    ABC 4
8   7648411 XYZ 4
9   1190    DEF 4
10  7648411 XYZ 4
11  1318    DEF 4
12  7648411 XYZ 4
13  2932    DEF 4
14  1318    DEF 4
15  7648411 XYZ 4
16  2932    DEF 4
17  1318    DEF 4
18  7648411 XYZ 4
19  2932    XYZ 5
20  1318    DEF 5
21  7648411 XYZ 5
22  2932    DEF 5

The code I am applying:

    df['D']=0
for i in range(len(df)):
        if i==0 and df["B"][i]!="XYZ":
            df.loc[i,'D']=df["A"][i]
        else:
            if df["C"][i]!=df["C"][i-1] and df["B"][i]!="XYZ":
                df['D'][i]=df["A"][i] 
            else:
                df['D'][i]=df['D'][i-1]

Result:

      A      B  C   D
0   6039    ABC 2   6039
1   1319    DEF 2   6039
2   6039    ABC 2   6039
3   2811    DEF 2   6039
4   9256223 XYZ 2   6039
5   7021    ABC 3   7021
6   1302    DEF 3   7021
7   3344    ABC 4   3344
8   7648411 XYZ 4   3344
9   1190    DEF 4   3344
10  7648411 XYZ 4   3344
11  1318    DEF 4   3344
12  7648411 XYZ 4   3344
13  2932    DEF 4   3344
14  1318    DEF 4   3344
15  7648411 XYZ 4   3344
16  2932    DEF 4   3344
17  1318    DEF 4   3344
18  7648411 XYZ 4   3344
19  2932    XYZ 5   3344
20  1318    DEF 5   3344
21  7648411 XYZ 5   3344
22  2932    DEF 5   3344

I am getting the correct result in column D, but the process is highly inefficient over more than 80K rows.

Is there a better way of accomplishing this? Maybe making a function and using apply method to it?

something in this format maybe?:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2, x.col_3), axis=1)

I am not sure, how to apply this?

CodePudding user response：

EDIT: with new input:

df["D"] = df.groupby(((df.C != df.C.shift()) & (df["B"] != "XYZ")).cumsum())["A"].transform("first")
print(df)

Prints:

          A    B  C     D
0      6039  ABC  2  6039
1      1319  DEF  2  6039
2      6039  ABC  2  6039
3      2811  DEF  2  6039
4   9256223  XYZ  2  6039
5      7021  ABC  3  7021
6      1302  DEF  3  7021
7      3344  ABC  4  3344
8   7648411  XYZ  4  3344
9      1190  DEF  4  3344
10  7648411  XYZ  4  3344
11     1318  DEF  4  3344
12  7648411  XYZ  4  3344
13     2932  DEF  4  3344
14     1318  DEF  4  3344
15  7648411  XYZ  4  3344
16     2932  DEF  4  3344
17     1318  DEF  4  3344
18  7648411  XYZ  4  3344
19     2932  XYZ  5  3344
20     1318  DEF  5  3344
21  7648411  XYZ  5  3344
22     2932  DEF  5  3344

CodePudding user response：

Edit: I think this will do it.

df.groupby(((df.C != df.C.shift()) & (df.B != 'XYZ')).cumsum()).apply(lambda frame: frame.assign(D=frame.A.iloc[0]))

The complex condition sets boolean flags at every position where a new C value occurs as long as B is not XYZ. We turn that into groups with cumsum and et voila. For the .apply() part, Andrej Keseley's .transform("first") solution is much more elegant.