I have a pandas dataframe that looks like the one below:
df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
})
hourOfDay category
0 5 pageA
1 5 pageB
2 8 pageA
3 8 pageB
4 13 pageA
5 13 pageB
Now, what I want is to create a new column with a monotonically increasing id. This id should be having same value within a group (hourOfDay). I'm giving the example of the expected dataframe below.
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
For now, we can assume that the category column can have only two values for simplicity, but it can be extended later. If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it. I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the df.groupby("hourOfDay").cumcount()
method and then finally merge the two dataframe. But this approach seems way too convoluted. So, I was wondering if there's a simpler way of achieving the same thing.
CodePudding user response:
Try:
>>> df['index'] = df['hourOfDay'].eq(df['hourOfDay'].shift(-1)).cumsum()
>>> df
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
>>>
Use eq
and shift
to determine whether the current value is the same as the previous value, then use cumsum
to cumulatively sum up the True
s and False
s.
CodePudding user response:
Use diff
and cumsum
:
df['index'] = df['hourOfDay'].diff().ne(0).cumsum()
print(df)
# Output:
hourOfDay category index
0 5 pageA 1
1 5 pageB 1
2 8 pageA 2
3 8 pageB 2
4 13 pageA 3
5 13 pageB 3
CodePudding user response:
If need same index
per hourOfDay
use GroupBy.ngroup
:
df['index'] = df.groupby('hourOfDay', sort=True).ngroup() 1
Or factorize
:
df = df.sort_values('hourOfDay')
df['index'] = pd.factorize(df['hourOfDay'])[0] 1