I have a dataset, where authors are ranked by the order of authorship (1, 2, 3, etc).
Authorid Author Article Articleid Rank
1 John article 1 1 1
1 John article 2 2 2
1 John article 3 3 3
1 John article 4 4 3
2 Mary article 5 5 1
2 Mary article 6 6 2
2 Mary article 7 7 1
2 Mary article 8 8 8
I want to create three more Boolean columns If_first
, If_second
, If_last
.
The purpose of this - I want to show if the author is ranked 1, 2, or last in the article.
The last
means the maximum number in Rank
column (the maximum number for this Authorid
in the column Rank
).
I can do If_first
and If_second
, that is pretty easy, but not sure how to resolve If_last
.
df.loc[df['Rank'] == 1, 'If_first'] = 1
df.loc[df['Rank'] != 1, 'If_first'] = 0
df.loc[df['Rank'] == 2, 'If_second'] = 1
df.loc[df['Rank'] != 2, 'If_second'] = 0
Two rules here
If_first
=if_last
- treat him asif_first
If_second
=if_last
- treat him asif_second
Expected output:
Authorid Author Article Articleid Rank If_first If_second If_last
1 John article 1 1 1 1 0 0
1 John article 2 2 2 0 1 0
1 John article 3 3 3 0 0 1 (third is the last here)
2 Mary article 5 5 1 1 0 0
2 Mary article 6 6 2 0 1 0
2 Mary article 7 7 3 0 0 0 (third is not the last here, because of the fourth below, all zeros)
2 Mary article 8 8 4 0 0 1 (fourth is the last here)
CodePudding user response:
Try this:
df = df.reset_index(drop=True)
res = df.groupby('Authorid')['Rank'].apply(lambda x: [x.idxmin(), x.drop_duplicates()[1:].nsmallest(1).index[0], x.idxmax()])
df[['If_first', 'If_second', 'If_last']] = 0
df.loc[res.str[0].tolist(), 'If_first'] = 1
df.loc[res.str[1].tolist(), 'If_second'] = 1
df.loc[res.str[2].tolist(), 'If_last'] = 1
Output:
>>> df
Authorid Author Article Articleid Rank If_first If_second If_last
0 John article 1 1 1 1 0 0
1 John article 2 2 2 0 1 0
2 John article 3 3 3 0 0 1
3 John article 4 4 3 0 0 0
4 Mary article 5 5 1 1 0 0
5 Mary article 6 6 2 0 1 0
6 Mary article 7 7 1 0 0 0
7 Mary article 8 8 8 0 0 1
CodePudding user response:
One approach might be to create a second dataframe grouped by Articleid
collecting the statistic you're interested in:
df2 = df.groupby('Articleid').agg(mxrank=('Rank', 'max'))
then add the new column by merging the dataframes:
dfm = df.merge(df2, how='left', on='Articleid')
With example result (with some added rows to demonstrate an article "article4" with multiple ranks):
Authorid Author Article Articleid Rank mxrank
0 1 John article1 1 1 1
1 1 John article2 2 2 2
2 1 John article3 3 3 3
3 1 John article4 4 3 4
4 1 Foo article4 4 1 4
5 1 Bar article4 4 2 4
6 1 Baz article4 4 4 4
7 2 Mary article5 5 1 1
8 2 Mary article6 6 2 2
9 2 Mary article7 7 1 1
10 2 Mary article8 8 8 8
Then compare the mxrank
column to Rank
to determine the flag for each row.