Home > Software engineering >  Data frame transformation - create a ranked data frame based on score values
Data frame transformation - create a ranked data frame based on score values

Time:12-02

It was a bit hard write a more explanatory problem title but here is a more detailed explanation.

I have a quadratic dataframe that looks like the below. Index names = column names and for each image there is a similarity score calculated. For each image name, I need to extract the top n most similar images (ranked by the scores) and store them in a data frame that does not take so much space.

               name_A.jpg     name_B.jpg     name_C.jpg     name_D.jpg   ...

name_A.jpg     1.000000       0.725098       0.291748       0.444336
name_B.jpg     0.725098       1.000000       0.255371       0.482178
name_C.jpg     0.291748       0.255371       1.000000       0.382812
name_D.jpg     0.444336       0.482178       0.382812       1.000000
name_E.jpg     0.197998       0.276611       0.183594       0.242065
name_F.jpg     0.309570       0.292236       0.327148       0.387695
name_G.jpg     0.302490       0.280273       0.339844       0.377197
name_H.jpg     0.261475       0.278076       0.258301       0.323975
name_J.jpg     0.243164       0.261963       0.304932       0.314453
name_K.jpg     0.269043       0.254639       0.247681       0.259766
name_L.jpg     0.251465       0.238892       0.227539       0.233887
name_M.jpg     0.287354       0.299805       0.216553       0.259766
name_N.jpg     0.413818       0.460938       0.239136       0.358398
name_O.jpg     0.394043       0.489258       0.293701       0.526855
name_P.jpg     0.262451       0.235229       0.224487       0.210083
name_Q.jpg     0.124634       0.137695       0.095032       0.142944
name_R.jpg     0.173218       0.187134       0.203491       0.194092
...

So the desired output is something like this:

               0              1              2              3              ....          n
name_A.jpg     name_B.jpg     name_D.jpg     name_N.jpg     name_O.jpg     
name_B.jpg
name_C.jpg
name_D.jpg
...

So if I look at name_A.jpg on a website, the recommended products are name_b.jpg, name_D.jpg, name_N.jpg, name_O.jpg .... In my case I've got around 300,000 images and I want to display the top n = 50 most similar images, hence the desired output data frame will have dimensions 300,000 x 50.

Of course I could just subset each column, use sort() in descending order and cap the top 50 rows. This requires however a for loop doing the same thing 300 000 times. Is there some other faster way of doing this?

CodePudding user response:

convert values of columns and DataFrame to numpy array and get positions of descending order values by convert array to nagetive values, then filter ouput first top1 value (because always 1) and get top N columns values to new DataFrame:

arr = df.to_numpy()
cols = df.columns.to_numpy()

N = 3
df = pd.DataFrame(cols[np.argsort(-arr)[:, 1:N 1]], index=df.index)
print (df.head(4))
                     0           1           2
name_A.jpg  name_B.jpg  name_D.jpg  name_C.jpg
name_B.jpg  name_A.jpg  name_D.jpg  name_C.jpg
name_C.jpg  name_D.jpg  name_A.jpg  name_B.jpg
name_D.jpg  name_B.jpg  name_A.jpg  name_C.jpg
  • Related