Enumerate rows based on condition in python-CodePudding

How can I enumerate rows in a pandas dataframe based on a condition?

In this example i want to enumerate all rows where column "Check" = "True". So row 2 and 3 should be enumerated with 1 and row 9 to 23 should be enumerated with 2.

Thank you very much for your ideas!

CodePudding user response：

Let's say we had the following dataframe:

>>> df
                  data  check
0  3588408447395168256  False
1  3611346577085431808   True
2   749264861627678720   True
3  8639996896158416896  False
4  2566054320101392384  False
5  2206545325982744576   True
6  1391007265699397632   True
7  1445594781760618496   True
8  9025069881367724032  False

df['check'].diff() gives us whether each value in the check column differs from the previous one. .ne(False) gives us the ones that do not differ. .cumsum() then gives us a sequentially increasing values we can use to group the rows with:

groups = list(df.groupby(df['check'].diff().ne(False).cumsum()))

This gives us a list of tuples of type (int, DataFrame):

1:
                  data  check
0  3588408447395168256  False

2:
                  data  check
1  3611346577085431808   True
2   749264861627678720   True

3:
                  data  check
3  8639996896158416896  False
4  2566054320101392384  False

4:
                  data  check
5  2206545325982744576   True
6  1391007265699397632   True
7  1445594781760618496   True

5:
                  data  check
8  9025069881367724032  False

You can get every one where check is True by taking every second element, either starting at 0 or 1 depending on what the value of check in the first group is:

groups[not groups[0][1]['check'][0]::2]

Then if you want to re-enumerate them with indices that only count the groups where check is True, you can do like with Python's built-in enumerate function:

enumerate(x[1] for x in groups[not groups[0][1]['check'][0]::2])

CodePudding user response：

I made pdrle package that can help with this. Basically, you identify runs of identical values and use the length of those runs to assign id.

import pdrle


df
#                   data  check
# 0  3588408447395168256  False
# 1  3611346577085431808   True
# 2   749264861627678720   True
# 3  8639996896158416896  False
# 4  2566054320101392384  False
# 5  2206545325982744576   True
# 6  1391007265699397632   True
# 7  1445594781760618496   True
# 8  9025069881367724032  False

rle = pdrle.encode(df.check)
rle
#         vals  runs
# check             
# 0      False     1
# 1       True     2
# 2      False     2
# 3       True     3
# 4      False     1

df["rleid"] = rle.vals.cumsum().repeat(rle.runs).reset_index(drop=True) * df.check
df
#                   data  check  rleid
# 0  3588408447395168256  False      0
# 1  3611346577085431808   True      1
# 2   749264861627678720   True      1
# 3  8639996896158416896  False      0
# 4  2566054320101392384  False      0
# 5  2206545325982744576   True      2
# 6  1391007265699397632   True      2
# 7  1445594781760618496   True      2
# 8  9025069881367724032  False      0