How to create a new pandas column with increasing sequence id, but retain same value within each gro-CodePudding

I have a pandas dataframe that looks like the one below:

df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
                 'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
                })

    hourOfDay   category
0   5           pageA
1   5           pageB
2   8           pageA
3   8           pageB
4   13          pageA
5   13          pageB

Now, what I want is to create a new column with a monotonically increasing id. This id should be having same value within a group (hourOfDay). I'm giving the example of the expected dataframe below.

    hourOfDay   category    index
0           5   pageA       1
1           5   pageB       1
2           8   pageA       2
3           8   pageB       2
4          13   pageA       3
5          13   pageB       3

For now, we can assume that the category column can have only two values for simplicity, but it can be extended later. If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it. I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the df.groupby("hourOfDay").cumcount() method and then finally merge the two dataframe. But this approach seems way too convoluted. So, I was wondering if there's a simpler way of achieving the same thing.

CodePudding user response：

Try:

>>> df['index'] = df['hourOfDay'].eq(df['hourOfDay'].shift(-1)).cumsum()
>>> df
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3
>>>

Use eq and shift to determine whether the current value is the same as the previous value, then use cumsum to cumulatively sum up the Trues and Falses.

CodePudding user response：

Use diff and cumsum:

df['index'] = df['hourOfDay'].diff().ne(0).cumsum()
print(df)

# Output:
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3

CodePudding user response：

If need same index per hourOfDay use GroupBy.ngroup:

df['index'] = df.groupby('hourOfDay', sort=True).ngroup()   1

Or factorize:

df = df.sort_values('hourOfDay')
df['index'] = pd.factorize(df['hourOfDay'])[0]   1