It was a bit hard write a more explanatory problem title but here is a more detailed explanation.
I have a quadratic dataframe that looks like the below. Index names = column names and for each image there is a similarity score calculated. For each image name, I need to extract the top n most similar images (ranked by the scores) and store them in a data frame that does not take so much space.
name_A.jpg name_B.jpg name_C.jpg name_D.jpg ...
name_A.jpg 1.000000 0.725098 0.291748 0.444336
name_B.jpg 0.725098 1.000000 0.255371 0.482178
name_C.jpg 0.291748 0.255371 1.000000 0.382812
name_D.jpg 0.444336 0.482178 0.382812 1.000000
name_E.jpg 0.197998 0.276611 0.183594 0.242065
name_F.jpg 0.309570 0.292236 0.327148 0.387695
name_G.jpg 0.302490 0.280273 0.339844 0.377197
name_H.jpg 0.261475 0.278076 0.258301 0.323975
name_J.jpg 0.243164 0.261963 0.304932 0.314453
name_K.jpg 0.269043 0.254639 0.247681 0.259766
name_L.jpg 0.251465 0.238892 0.227539 0.233887
name_M.jpg 0.287354 0.299805 0.216553 0.259766
name_N.jpg 0.413818 0.460938 0.239136 0.358398
name_O.jpg 0.394043 0.489258 0.293701 0.526855
name_P.jpg 0.262451 0.235229 0.224487 0.210083
name_Q.jpg 0.124634 0.137695 0.095032 0.142944
name_R.jpg 0.173218 0.187134 0.203491 0.194092
...
So the desired output is something like this:
0 1 2 3 .... n
name_A.jpg name_B.jpg name_D.jpg name_N.jpg name_O.jpg
name_B.jpg
name_C.jpg
name_D.jpg
...
So if I look at name_A.jpg
on a website, the recommended products are name_b.jpg, name_D.jpg, name_N.jpg, name_O.jpg
.... In my case I've got around 300,000 images and I want to display the top n = 50 most similar images, hence the desired output data frame will have dimensions 300,000 x 50.
Of course I could just subset each column, use sort()
in descending order and cap the top 50 rows. This requires however a for loop doing the same thing 300 000 times. Is there some other faster way of doing this?
CodePudding user response:
convert values of columns and DataFrame
to numpy array and get positions of descending order values by convert array to nagetive values, then filter ouput first top1 value (because always 1
) and get top N
columns values to new DataFrame
:
arr = df.to_numpy()
cols = df.columns.to_numpy()
N = 3
df = pd.DataFrame(cols[np.argsort(-arr)[:, 1:N 1]], index=df.index)
print (df.head(4))
0 1 2
name_A.jpg name_B.jpg name_D.jpg name_C.jpg
name_B.jpg name_A.jpg name_D.jpg name_C.jpg
name_C.jpg name_D.jpg name_A.jpg name_B.jpg
name_D.jpg name_B.jpg name_A.jpg name_C.jpg