I'm searching for the most efficient way to find if True
value in column A
is the first occurrence since last True
value in column B
.
In this examples the expected output would be column C
.
Example 1:
df = pd.DataFrame({
'A': [False, False, True, False, True, False, True, False, True],
'B': [True, False, False, False, False, True, False, False, False],
'C': [False, False, True, False, False, False, True, False, False]
})
A | B | C | |
---|---|---|---|
0 | False | True | False |
1 | False | False | False |
2 | True | False | True |
3 | False | False | False |
4 | True | False | False |
5 | False | True | False |
6 | True | False | True |
7 | False | False | False |
8 | True | False | False |
Example 2:
df = pd.DataFrame({
'A': [True, False, False, True, False, True, False, True, False],
'B': [False, True, False, False, False, False, True, False, False],
'C': [False, False, False, True, False, False, False, True, False]
})
A | B | C | |
---|---|---|---|
0 | True | False | False |
1 | False | True | False |
2 | False | False | False |
3 | True | False | True |
4 | False | False | False |
5 | True | False | False |
6 | False | True | False |
7 | True | False | True |
8 | False | False | False |
Here you can find a .csv file with a bigger example
CodePudding user response:
You can use a groupby
operation on the cumulative sum of column "B" to group your dataframe how you described. Then you can use idxmax
to get the index where each of those first occurrences exist within column "A". Once you have those indices, you can create your new column "C".
Using idxmax
is a little trick because we're not actually interested in the maximum value since column "A" only ever has True
and False
as its values. idxmax
will return the index of the first occurrence of the maximum (in this case, the first occurrence of True
within each group), which is what we're specifically interested in.
df = pd.DataFrame({
'A': [False, False, True, False, True, False, True, False, True],
'B': [True, False, False, False, False, True, False, False, False],
})
indices = df["A"].groupby(df["B"].cumsum()).idxmax()
# corrects for example 2 (maintains compatibility with example 1)
indices = indices.loc[1:]
df["C"] = False
df.loc[indices, "C"] = True
print(df)
A B C
0 False True False
1 False False False
2 True False True
3 False False False
4 True False False
5 False True False
6 True False True
7 False False False
8 True False False
Updated for example 2.
We can resolve this issue by slicing our indices Series to exclude any entry whose index is 0 (e.g. label slicing from 1 to the end). This works because of our groupby
operation assigns integer based labels according to the .cumsum
. In example 1, the smallest index label will be 1 (since the first value in column "B" is True). Whereas in example 2, the smallest index label will be 0. Since we don't want the 0 to effect our results, we can simply slice it away from our indices
.
When we assign "C" after performing the slicing on our indices
Series, we will appropriately ignore all of the values from before the first occurrence of True in column "B".
Enough text though, lets see some code.
Example 1
print(indices)
1 2
2 6
# Slicing here doesn't change anything, since indices does not have
# a value corresponding to label position 0
indices = indices.loc[1:]
print(indices)
1 2
2 6
Example 2
print(indices)
0 0
1 3
2 7
# we don't want to include the value from label position 0 in `indices`
# so we can use slicing to remove it
indices = indices.loc[1:]
print(indices)
1 3
2 7
CodePudding user response:
This is a way to do it, maybe not the best one.
is_occurred = False
def is_first_occurrence_since(column_to_check, column_occurence):
global is_occurred
if is_occurred and column_to_check == True:
is_occurred = False
return True
elif not is_occurred and column_occurence == True:
is_occurred = True
return False
df.apply(lambda row: is_first_occurrence_since(row['A'], row['B']), axis=1)