Home > Net >  Row Number using multiple columns in a Pandas Dataframe
Row Number using multiple columns in a Pandas Dataframe

Time:02-05

I have a the below dataset:

timestamp   conversationId   UserId  MessageId       tpMessage   Message 
1614578324  ceb9004ae9d3    1c376ef 5bbd34859329    question    Where do you live?
1614578881  ceb9004ae9d3    1c376ef d3b5d3884152    answer      Brooklyn
1614583764  ceb9004ae9d3    1c376ef 0e4501fcd61f    question    What's your name?
1614590885  ceb9004ae9d3    1c376ef 97d841b79ff7    answer      Phill
1614594952  ceb9004ae9d3    1c376ef 11ed3fd24767    question    What's your gender?
1614602036  ceb9004ae9d3    1c376ef 601538860004    answer      Male
1614602581  ceb9004ae9d3    1c376ef 8bc8d9089609    question    How old are you?
1614606219  ceb9004ae9d3    1c376ef a2bd45e64b7c    answer      35
1614606240  jto9034pe0i5    1c489rl o6bd35e64b5j    question    What's your name?
1614606250  jto9034pe0i5    1c489rl 96jd89i55b7t    answer      Robert  

and I'm trying to use a similar ROW_NUMBER function in pandas

ROW_NUMBER() OVER(PARTITION BY userId ORDER BY UserId,timestamp,conversationId    ASC) AS num_Row

I tried some aproachs so far, none worked as intended:

df['row_number'] = df.groupby(['userId','timestamp','conversationId']).cumcount()   1

or

df['row_number'] = df.sort_values(['userId','timestamp','conversationId'], ascending=[True,False]) \
             .groupby(['userId']) \
             .cumcount()   1
print(df)

my disered output is as folows :

timestamp   conversationId   UserId  MessageId       tpMessage   Message                num_row     
1614578324  ceb9004ae9d3    1c376ef 5bbd34859329    question    Where do you live?  1
1614578881  ceb9004ae9d3    1c376ef d3b5d3884152    answer      Brooklyn            2
1614583764  ceb9004ae9d3    1c376ef 0e4501fcd61f    question    What's your name?   3
1614590885  ceb9004ae9d3    1c376ef 97d841b79ff7    answer      Phill               4
1614594952  ceb9004ae9d3    1c376ef 11ed3fd24767    question    What's your gender? 5
1614602036  ceb9004ae9d3    1c376ef 601538860004    answer      Male                6
1614602581  ceb9004ae9d3    1c376ef 8bc8d9089609    question    How old are you?    7
1614606219  ceb9004ae9d3    1c376ef a2bd45e64b7c    answer      35                  8
1614606240  jto9034pe0i5    1c489rl o6bd35e64b5j    question    What's your name?   1
1614606250  jto9034pe0i5    1c489rl 96jd89i55b7t    answer      Robert              2

could you guys help on that?

CodePudding user response:

Variation of you last attempt, which gives the provided output and matches the logic:

df['num_row'] = (df
  .sort_values(by=['timestamp', 'conversationId'],
               ascending=True) # this is the default
  .groupby('UserId', sort=False).cumcount().add(1)
)

Output:

    timestamp conversationId   UserId     MessageId tpMessage              Message  num_row
0  1614578324   ceb9004ae9d3  1c376ef  5bbd34859329  question   Where do you live?        1
1  1614578881   ceb9004ae9d3  1c376ef  d3b5d3884152    answer             Brooklyn        2
2  1614583764   ceb9004ae9d3  1c376ef  0e4501fcd61f  question    What's your name?        3
3  1614590885   ceb9004ae9d3  1c376ef  97d841b79ff7    answer                Phill        4
4  1614594952   ceb9004ae9d3  1c376ef  11ed3fd24767  question  What's your gender?        5
5  1614602036   ceb9004ae9d3  1c376ef  601538860004    answer                 Male        6
6  1614602581   ceb9004ae9d3  1c376ef  8bc8d9089609  question     How old are you?        7
7  1614606219   ceb9004ae9d3  1c376ef  a2bd45e64b7c    answer                   35        8
8  1614606240   jto9034pe0i5  1c489rl  o6bd35e64b5j  question    What's your name?        1
9  1614606250   jto9034pe0i5  1c489rl  96jd89i55b7t    answer               Robert        2
  • Related