how to set a value of a column by different rows-CodePudding

here is my data example:

SN	test case	measure	test result
SN1	TC1		Fail
SN1	TC1	value1	Pass
SN1	TC2	value2	Pass
SN2	TC1	value1	Pass
SN2	TC2		Fail
SN2	TC2	value2	Pass

you can see in the data that the failed cases are passed by retesting the case. what i want to do is to identify that which passed cases are retested cases.

so i want to add another column named 'retest' and only marked the retested cases as 1, others as 0.

SN	test case	measure	test result	retest
SN1	TC1		Fail	0
SN1	TC1	value1	Pass	1
SN1	TC2	value2	Pass	0
SN2	TC1	value1	Pass	0
SN2	TC2		Fail	0
SN2	TC2	value2	Pass	1

so the values in the column 'retest' are depended on the different rows and different columns.

i've tried to use df.apply(), but it can only pass one row to the function. and it is not efficiency, if i use for loop, for there are more than 56M rows in the database.

is there any efficient way to use pandas to achieve this task?

CodePudding user response：

failed_before = (df["test result"]
                   .eq("Fail")
                   .groupby([df["SN"], df["test case"]])
                   .transform("any"))

passed = df["test result"].eq("Pass")

df["retest"] = (passed & failed_before).astype(int)

check for each SN & test_case pair to see if they have any failings
- with transform, this is spread to each row of the unique SN & test_case pairs
check if a given row has status "Pass"
- both steps give a True/False result
since "retested === passed now but also had a failure", we AND the two conditions
- result is True/False's, cast to integers at the end

to get

>>> df

    SN test case measure test result  retest
0  SN1       TC1     NaN        Fail       0
1  SN1       TC1  value1        Pass       1
2  SN1       TC2  value2        Pass       0
3  SN2       TC1  value1        Pass       0
4  SN2       TC2     NaN        Fail       0
5  SN2       TC2  value2        Pass       1

(assumes there's no failure once passed in a given SN & test_case pair.)

When you have a GroupBy object, it doesn't really do anything on its own; you need to do something further to generate results out of it. .agg is one way of doing it: it will aggregate per-group results (.apply is another similar one). There's also this .transform. Let's see for your data what .agg does after a groupby operation:

# without transform, but agg
df["test result"].eq("Fail").groupby([df["SN"], df["test case"]]).agg("any")

SN   test case
SN1  TC1           True
     TC2          False
SN2  TC1          False
     TC2           True
Name: test result, dtype: bool

so we first checked equality against "Fail" as before
then group that result over "SN" and "test case" pairs
- at this point, groupby object is "lazy" so to speak
- now comes the action...
...we aggregate each group's result to produce a single value out of them
- in this case we do it with "any"
- i.e., "if there were at least one True in a given group, give True for this group (e.g., for SN1-TC1); otherwise (when all are False), give False for the group (e.g., SN1-TC2)

Now, you see that we have a 4 row result at the end. Why? Because that's how many unique SN-test_case pairs there are! "SN1-TC1", "SN1-TC2", "SN2-TC1" and "SN2-TC2". That's fine: we got a single result for each group.

However, what if you want to repeat that single result for each occurences of a given group in the original dataframe? As we see, the original dataframe has 6 rows: this implies some groups have records more than once (actually those are "SN1-TC1" and "SN2-TC2"). Hmm... so we need a result that has the same number of rows as the original dataframe, while repeating this aggregation result for each group as many as needed. That's where transform comes into play:

df["test result"].eq("Fail").groupby([df["SN"], df["test case"]]).transform("any")

0     True
1     True
2    False
3    False
4     True
5     True
Name: test result, dtype: bool

You see how we got a 6-row result? Previously, we got a True value for "SN1-TC1" with .agg but it was only showing it once. Now, it repeats that True value for each "SN1-TC1" pair in the original dataframe, i.e., at indexes 0 and 1 here. Similar goes for all the groups, and we get a len(df) length result.

Now comes "why" do we need to transform? Because as you see in the code above further, we needed to compare passed with failed_before. Now, passed is nothing but df["test result"].eq("Pass") -- this is a True/False Series of the same length as df. It telss for each row of "test result" that whether it had a "Pass" status or not. Cool. We want to compare this to failed_before values: if we had .agg and produce a single result for unique group members, there would be a length mismatch! It has 4 rows because 4 unique values, but we need full 6 rows to compare against passed -- and that's where transform helps us by repeating the aggregated values as necessary to enable comparison.

Lastly, both agg and transform take an arbitrary function to perform on groups; however, some functionalities are so common that they have string shortcuts like "any", "all", "mean", "sum", "unique" etc. IOW, we could have done lambda gr: gr.any() instead of "any" above in both agg or transform, but "any" is at our disposal as a convenience. (As an aside, GroupBy.agg can take a list of functions etc. as mentioned in its documentation, but that's not really related to the distinction here.)