I have this example df:
df = pd.DataFrame({
'Name': ['Sara', 'Adam','Ciara', 'John','Paul'],
'UK': [1,2,0,4,2],
'US': [1,3,0,1,2],
'IN': [1,5,0,1,0],
'CA': [0,0,0,7,1],
'JP': [0,0,0,0,1]
})
df
Gives:
Name UK US IN CA JP
0 Sara 1 1 1 0 0
1 Adam 2 3 5 0 0
2 Ciara 0 0 0 0 0
3 John 4 1 1 7 0
4 Paul 2 2 0 1 1
I would like to get the top 2 visited countries (Columns names) for each row, some rows have all 0
and others could have equal numbers 1,1,1
I already started to get some results but the output was not as expected:
io = df[['UK','US','IN','CA','JP']] #getting only countries columns as a df
Then
pd.DataFrame({n: io.T[col].astype(int).nlargest(2).index.tolist() for n, col in enumerate(io.T)}).T.join(df)
This is what I got:
0 1 Name UK US IN CA JP
0 UK US Sara 1 1 1 0 0
1 IN US Adam 2 3 5 0 0
2 UK US Ciara 0 0 0 0 0
3 CA UK John 4 1 1 7 0
4 UK US Paul 2 2 0 1 1
For the first row UK,US,IN
columns all have equal values, so I am confused how to handle equal results and how nlargest(2) selected US,UK
and not IN
?
Since nlargest(2) would not be applicable for ties, I would like to include all the columns names
for the Third row with all 0,0,0,0,0
it still shows UK,US
I tried to convert all 0 to NaN but got lots of errors
here is an example of the expected output - the first row is an example of ties:
0 1 2 Name UK US IN CA JP
0 UK US IN Sara 1 1 1 0 0
1 IN US 0 Adam 2 3 5 0 0
2 0 0 0 Ciara 0 0 0 0 0
3 CA UK 0 John 4 1 1 7 0
4 UK US 0 Paul 2 2 0 1 1
CodePudding user response:
Here is a unpolished answer with dot product and rank along axis=1
tmp = df.set_index("Name")
out = (pd.DataFrame(tmp.rank(axis=1,ascending=False).le(2).dot(tmp.columns ',')
.str.split(",")
.to_list()).iloc[:,:-1]
.replace('',np.nan))
print(out)
0 1 2
0 UK US IN
1 US IN NaN
2 NaN None None
3 UK CA NaN
4 UK US NaN