Home > OS >  How to create an ordinal sequence column (rank) based on two columns in Pandas dataframe
How to create an ordinal sequence column (rank) based on two columns in Pandas dataframe

Time:09-10

I have Amazon reviews dataset that includes reviewer_id and review_date. For each reviewer, I need to rank his reviews based on their arrival date.

I will explain the problem I am facing with the example below. Reviewer aaa posted 5 reviews, two of them arrived on the same date. The ones that arrived on the same date should have the same ranking (2 in this example). The next review that arrived on 1/3/2000 should be ranked 4 not 3 because it is review #4 for reviewer aaa. All the codes I tried will continue the ranking based on the last rank number which is not what I need.

Same thing for reviewer bbb, the two reviews that were posted on 9/10/2010 should rank 1. Then the next review posted on 11/10/2010 should rank 3.

Thank you in advance!

reviewer_id review_date rank
aaa 1/1/2000 1
aaa 1/2/2000 2
aaa 1/2/2000 2
aaa 1/3/2000 4
aaa 5/9/2002 5
bbb 9/10/2010 1
bbb 9/10/2010 1
bbb 11/10/2010 3
bbb 12/10/2010 4

here is the code I tried with no sucess:

df['rank']=df.groupby('reviewer_id').review_date.transform(lambda x : pd.factorize(x)[0] 1)

This code will produce the following rank column which is not what I need:

reviewer_id review_date rank
aaa 1/1/2000 1
aaa 1/2/2000 2
aaa 1/2/2000 2
aaa 1/3/2000 3
aaa 5/9/2002 4

CodePudding user response:

You can use groupby.rank with the method='min' parameter:

df['rank'] = (pd.to_datetime(df['review_date'])
                .groupby(df['reviewer_id']).rank('min')
                .convert_dtypes()
             )

output:

  reviewer_id review_date  rank
0         aaa    1/1/2000     1
1         aaa    1/2/2000     2
2         aaa    1/2/2000     2
3         aaa    1/3/2000     4
4         aaa    5/9/2002     5
5         bbb   9/10/2010     1
6         bbb   9/10/2010     1
7         bbb  11/10/2010     3
8         bbb  12/10/2010     4

CodePudding user response:

df = df.sort_values(['reviewer_id','review_date'])
df['rank'] = 1   df.groupby('reviewer_id').cumcount()
df.loc[df.review_date == df.review_date.shift(), 'rank'] = pd.NA
df.pad(inplace=True)     # fill forward

The logic of the code above:

  1. Sorting is made just to be sure, that our dates are ordered and consecutive dates are stacked side by side.
  2. Assign individual indexes, starting at 1, for each group by reviewr_id.
  3. Mark as nan those indexes for which the corresponding dates are duplicates of the previous ones.
  4. Propagate last valid observation forward to next valid.
  • Related