here is my data example:
SN | test case | measure | test result |
---|---|---|---|
SN1 | TC1 | Fail | |
SN1 | TC1 | value1 | Pass |
SN1 | TC2 | value2 | Pass |
SN2 | TC1 | value1 | Pass |
SN2 | TC2 | Fail | |
SN2 | TC2 | value2 | Pass |
you can see in the data that the failed cases are passed by retesting the case. what i want to do is to identify that which passed cases are retested cases.
so i want to add another column named 'retest' and only marked the retested cases as 1, others as 0.
SN | test case | measure | test result | retest |
---|---|---|---|---|
SN1 | TC1 | Fail | 0 | |
SN1 | TC1 | value1 | Pass | 1 |
SN1 | TC2 | value2 | Pass | 0 |
SN2 | TC1 | value1 | Pass | 0 |
SN2 | TC2 | Fail | 0 | |
SN2 | TC2 | value2 | Pass | 1 |
so the values in the column 'retest' are depended on the different rows and different columns.
i've tried to use df.apply(), but it can only pass one row to the function. and it is not efficiency, if i use for loop, for there are more than 56M rows in the database.
is there any efficient way to use pandas to achieve this task?
CodePudding user response:
failed_before = (df["test result"]
.eq("Fail")
.groupby([df["SN"], df["test case"]])
.transform("any"))
passed = df["test result"].eq("Pass")
df["retest"] = (passed & failed_before).astype(int)
check for each SN & test_case pair to see if they have any failings
- with
transform
, this is spread to each row of the unique SN & test_case pairs
- with
check if a given row has status "Pass"
- both steps give a True/False result
since "retested === passed now but also had a failure", we AND the two conditions
- result is True/False's, cast to integers at the end
to get
>>> df
SN test case measure test result retest
0 SN1 TC1 NaN Fail 0
1 SN1 TC1 value1 Pass 1
2 SN1 TC2 value2 Pass 0
3 SN2 TC1 value1 Pass 0
4 SN2 TC2 NaN Fail 0
5 SN2 TC2 value2 Pass 1
(assumes there's no failure once passed in a given SN & test_case pair.)
When you have a GroupBy object, it doesn't really do anything on its own; you need to do something further to generate results out of it. .agg
is one way of doing it: it will aggregate per-group results (.apply
is another similar one). There's also this .transform
. Let's see for your data what .agg
does after a groupby operation:
# without transform, but agg
df["test result"].eq("Fail").groupby([df["SN"], df["test case"]]).agg("any")
SN test case
SN1 TC1 True
TC2 False
SN2 TC1 False
TC2 True
Name: test result, dtype: bool
- so we first checked equality against "Fail" as before
- then group that result over "SN" and "test case" pairs
- at this point, groupby object is "lazy" so to speak
- now comes the action...
- ...we aggregate each group's result to produce a single value out of them
- in this case we do it with "any"
- i.e., "if there were at least one True in a given group, give True for this group (e.g., for SN1-TC1); otherwise (when all are False), give False for the group (e.g., SN1-TC2)
Now, you see that we have a 4 row result at the end. Why? Because that's how many unique SN-test_case pairs there are! "SN1-TC1", "SN1-TC2", "SN2-TC1" and "SN2-TC2". That's fine: we got a single result for each group.
However, what if you want to repeat that single result for each occurences of a given group in the original dataframe? As we see, the original dataframe has 6 rows: this implies some groups have records more than once (actually those are "SN1-TC1" and "SN2-TC2"). Hmm... so we need a result that has the same number of rows as the original dataframe, while repeating this aggregation result for each group as many as needed. That's where transform comes into play:
df["test result"].eq("Fail").groupby([df["SN"], df["test case"]]).transform("any")
0 True
1 True
2 False
3 False
4 True
5 True
Name: test result, dtype: bool
You see how we got a 6-row result? Previously, we got a True value for "SN1-TC1" with .agg
but it was only showing it once. Now, it repeats that True value for each "SN1-TC1" pair in the original dataframe, i.e., at indexes 0 and 1 here. Similar goes for all the groups, and we get a len(df) length result.
Now comes "why" do we need to transform? Because as you see in the code above further, we needed to compare passed
with failed_before
. Now, passed
is nothing but df["test result"].eq("Pass")
-- this is a True/False Series of the same length as df
. It telss for each row of "test result" that whether it had a "Pass" status or not. Cool. We want to compare this to failed_before
values: if we had .agg
and produce a single result for unique group members, there would be a length mismatch! It has 4 rows because 4 unique values, but we need full 6 rows to compare against passed
-- and that's where transform
helps us by repeating the aggregated values as necessary to enable comparison.
Lastly, both agg
and transform
take an arbitrary function to perform on groups; however, some functionalities are so common that they have string shortcuts like "any", "all", "mean", "sum", "unique" etc. IOW, we could have done lambda gr: gr.any()
instead of "any"
above in both agg or transform, but "any" is at our disposal as a convenience. (As an aside, GroupBy.agg can take a list of functions etc. as mentioned in its documentation, but that's not really related to the distinction here.)