I have a the below dataset:
timestamp conversationId UserId MessageId tpMessage Message
1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live?
1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn
1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name?
1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill
1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender?
1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male
1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you?
1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35
1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name?
1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert
and I'm trying to use a similar ROW_NUMBER function in pandas
ROW_NUMBER() OVER(PARTITION BY userId ORDER BY UserId,timestamp,conversationId ASC) AS num_Row
I tried some aproachs so far, none worked as intended:
df['row_number'] = df.groupby(['userId','timestamp','conversationId']).cumcount() 1
or
df['row_number'] = df.sort_values(['userId','timestamp','conversationId'], ascending=[True,False]) \
.groupby(['userId']) \
.cumcount() 1
print(df)
my disered output is as folows :
timestamp conversationId UserId MessageId tpMessage Message num_row
1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live? 1
1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn 2
1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name? 3
1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill 4
1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender? 5
1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male 6
1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you? 7
1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35 8
1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name? 1
1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert 2
could you guys help on that?
CodePudding user response:
Variation of you last attempt, which gives the provided output and matches the logic:
df['num_row'] = (df
.sort_values(by=['timestamp', 'conversationId'],
ascending=True) # this is the default
.groupby('UserId', sort=False).cumcount().add(1)
)
Output:
timestamp conversationId UserId MessageId tpMessage Message num_row
0 1614578324 ceb9004ae9d3 1c376ef 5bbd34859329 question Where do you live? 1
1 1614578881 ceb9004ae9d3 1c376ef d3b5d3884152 answer Brooklyn 2
2 1614583764 ceb9004ae9d3 1c376ef 0e4501fcd61f question What's your name? 3
3 1614590885 ceb9004ae9d3 1c376ef 97d841b79ff7 answer Phill 4
4 1614594952 ceb9004ae9d3 1c376ef 11ed3fd24767 question What's your gender? 5
5 1614602036 ceb9004ae9d3 1c376ef 601538860004 answer Male 6
6 1614602581 ceb9004ae9d3 1c376ef 8bc8d9089609 question How old are you? 7
7 1614606219 ceb9004ae9d3 1c376ef a2bd45e64b7c answer 35 8
8 1614606240 jto9034pe0i5 1c489rl o6bd35e64b5j question What's your name? 1
9 1614606250 jto9034pe0i5 1c489rl 96jd89i55b7t answer Robert 2