Home > database >  nlargest(2) at row level to retrieve columns names & equal values
nlargest(2) at row level to retrieve columns names & equal values

Time:06-23

I have this example df:

df = pd.DataFrame({
                  'Name': ['Sara', 'Adam','Ciara', 'John','Paul'],
                  'UK': [1,2,0,4,2],
                  'US': [1,3,0,1,2],
                  'IN': [1,5,0,1,0],
                  'CA': [0,0,0,7,1],
                  'JP': [0,0,0,0,1]
                   })

df

Gives:

    Name    UK  US  IN  CA  JP
0   Sara    1   1   1   0   0
1   Adam    2   3   5   0   0
2   Ciara   0   0   0   0   0
3   John    4   1   1   7   0
4   Paul    2   2   0   1   1

I would like to get the top 2 visited countries (Columns names) for each row, some rows have all 0 and others could have equal numbers 1,1,1

I already started to get some results but the output was not as expected:

io = df[['UK','US','IN','CA','JP']] #getting only countries columns as a df

Then

pd.DataFrame({n: io.T[col].astype(int).nlargest(2).index.tolist() for n, col in enumerate(io.T)}).T.join(df)

This is what I got:

    0   1   Name    UK  US  IN  CA  JP
0   UK  US  Sara    1   1   1   0   0
1   IN  US  Adam    2   3   5   0   0
2   UK  US  Ciara   0   0   0   0   0
3   CA  UK  John    4   1   1   7   0
4   UK  US  Paul    2   2   0   1   1

For the first row UK,US,IN columns all have equal values, so I am confused how to handle equal results and how nlargest(2) selected US,UK and not IN?

Since nlargest(2) would not be applicable for ties, I would like to include all the columns names

for the Third row with all 0,0,0,0,0 it still shows UK,US I tried to convert all 0 to NaN but got lots of errors

here is an example of the expected output - the first row is an example of ties:

    0   1   2       Name    UK  US  IN  CA  JP
0   UK  US  IN      Sara    1   1   1   0   0
1   IN  US  0       Adam    2   3   5   0   0
2   0   0   0       Ciara   0   0   0   0   0
3   CA  UK  0       John    4   1   1   7   0
4   UK  US  0       Paul    2   2   0   1   1

CodePudding user response:

Here is a unpolished answer with dot product and rank along axis=1

tmp = df.set_index("Name")
out = (pd.DataFrame(tmp.rank(axis=1,ascending=False).le(2).dot(tmp.columns ',')
                    .str.split(",")
                    .to_list()).iloc[:,:-1]
      .replace('',np.nan))

print(out)

     0     1     2
0   UK    US    IN
1   US    IN   NaN
2  NaN  None  None
3   UK    CA   NaN
4   UK    US   NaN
  • Related