Create multiple boolean columns in pandas dataframe based on multiple conditions-CodePudding

I have a dataset, where authors are ranked by the order of authorship (1, 2, 3, etc).

Authorid    Author  Article Articleid   Rank
1            John   article 1   1        1
1            John   article 2   2        2
1            John   article 3   3        3
1            John   article 4   4        3
2            Mary   article 5   5        1
2            Mary   article 6   6        2
2            Mary   article 7   7        1
2            Mary   article 8   8        8

I want to create three more Boolean columns If_first, If_second, If_last. The purpose of this - I want to show if the author is ranked 1, 2, or last in the article. The last means the maximum number in Rank column (the maximum number for this Authorid in the column Rank).

I can do If_first and If_second, that is pretty easy, but not sure how to resolve If_last.

df.loc[df['Rank'] == 1, 'If_first'] = 1
df.loc[df['Rank'] != 1, 'If_first'] = 0
df.loc[df['Rank'] == 2, 'If_second'] = 1
df.loc[df['Rank'] != 2, 'If_second'] = 0

Two rules here

If_first = if_last - treat him as if_first
If_second = if_last - treat him as if_second

Expected output:

Authorid    Author  Article Articleid   Rank    If_first    If_second   If_last
1            John   article 1   1        1       1              0         0
1            John   article 2   2        2       0              1         0
1            John   article 3   3        3       0              0         1 (third is the last here)
2            Mary   article 5   5        1       1              0         0
2            Mary   article 6   6        2       0              1         0
2            Mary   article 7   7        3       0              0         0 (third is not the last here, because of the fourth below, all zeros)
2            Mary   article 8   8        4       0              0         1 (fourth is the last here)

CodePudding user response：

Try this:

df = df.reset_index(drop=True)
res = df.groupby('Authorid')['Rank'].apply(lambda x: [x.idxmin(), x.drop_duplicates()[1:].nsmallest(1).index[0], x.idxmax()])

df[['If_first', 'If_second', 'If_last']] = 0
df.loc[res.str[0].tolist(), 'If_first'] = 1
df.loc[res.str[1].tolist(), 'If_second'] = 1
df.loc[res.str[2].tolist(), 'If_last'] = 1

Output:

>>> df
  Authorid   Author  Article  Articleid  Rank  If_first  If_second  If_last
0     John  article        1          1     1         1          0        0
1     John  article        2          2     2         0          1        0
2     John  article        3          3     3         0          0        1
3     John  article        4          4     3         0          0        0
4     Mary  article        5          5     1         1          0        0
5     Mary  article        6          6     2         0          1        0
6     Mary  article        7          7     1         0          0        0
7     Mary  article        8          8     8         0          0        1

CodePudding user response：

One approach might be to create a second dataframe grouped by Articleid collecting the statistic you're interested in:

df2 = df.groupby('Articleid').agg(mxrank=('Rank', 'max'))

then add the new column by merging the dataframes:

dfm = df.merge(df2, how='left', on='Articleid')

With example result (with some added rows to demonstrate an article "article4" with multiple ranks):

   Authorid Author   Article Articleid Rank mxrank
0         1   John  article1         1    1      1
1         1   John  article2         2    2      2
2         1   John  article3         3    3      3
3         1   John  article4         4    3      4
4         1    Foo  article4         4    1      4
5         1    Bar  article4         4    2      4
6         1    Baz  article4         4    4      4
7         2   Mary  article5         5    1      1
8         2   Mary  article6         6    2      2
9         2   Mary  article7         7    1      1
10        2   Mary  article8         8    8      8

Then compare the mxrank column to Rank to determine the flag for each row.