Home > Back-end >  Ranking columns with respect to a midpoint in a data.table
Ranking columns with respect to a midpoint in a data.table

Time:05-08

In the data.table below there are 4 columns. I want to set the ranking criteria of these columns in the following way -

Column a - The higher the value from midpoint value 1, the lower is the rank and vice versa
Column b - The higher the value from midpoint value 1, the higher is the rank and vice versa
Column x - The higher the value from midpoint value 50, the higher is the rank and vice versa
Column y - The higher the value from midpoint value 0, the lower is the rank and vice versa

Here is the solution, I have tried.

library(data.table)
set.seed(123)
dt = data.table(index=1:20, 
                a = c(NA, rnorm(19, mean = 1, sd = 0.5)), 
                b = c(rnorm(10, mean = 1, sd = 0.5), NA, NA, rnorm(8, mean = 1, sd = 0.5)) ,
                x = rnorm(20, mean = 50, sd = 20), 
                y = c(rnorm(19, mean = 0, sd = 10), NA))
dt

dt[, a_rank := round(scales::rescale_mid(a, mid = 1, to = c(2, 0)), 2)]
dt[, b_rank := round(scales::rescale_mid(b, mid = 1, to = c(0, 2)), 2)]
dt[, x_rank := round(scales::rescale_mid(x, mid = 50, to = c(0, 100)), 2)]
dt[, y_rank := round(scales::rescale_mid(y, mid = 0, to = c(20, -20)), 2)]
dt[, rank :=  sum(a_rank, b_rank, x_rank, y_rank, na.rm = T), by = index]
    index         a           b           x           y a_rank b_rank x_rank y_rank   rank
 1:     1        NA  1.75960886  10.1450302 -16.5104890     NA   1.78  11.25  18.44  31.47
 2:     2 0.8239768  1.18869399  41.4544143  -4.6398724   1.21   1.21  41.69   5.18  49.29
 3:     3 1.3517620 -0.02611141  52.3327457   8.2537986   0.58   0.00  52.27  -9.22  43.63
 4:     4 0.9471643  0.31798127  32.1358486   5.1013255   1.06   0.34  32.63  -5.70  28.33
 5:     5 0.3706757  0.89960949  56.6780588  -5.8948104   1.75   0.93  56.49   6.59  65.76
 6:     6 1.8422179  1.43288970  58.2285984  -9.9678074   0.00   1.46  58.00  11.14  70.60
 7:     7 1.4556956  0.94905837  49.3392768   1.4447570   0.46   0.97  49.36  -1.61  49.18
 8:     8 1.1187151  1.31209374   0.6820361  -0.1430741   0.86   1.34   2.05   0.16   4.41
 9:     9 1.6090543  1.47950269 101.4291629 -17.9028124   0.28   1.50 100.00  20.00 121.78
10:    10 0.3306129  1.83552741  45.8940149   0.3455107   1.79   1.86  46.01  -0.39  49.27
11:    11 1.3304101          NA  63.0238656   1.9023032   0.61     NA  62.66  -2.13  61.14
12:    12 0.7385438          NA  55.4753298   1.7472640   1.31     NA  55.32  -1.95  54.68
13:    13 1.3418728  1.02800837  70.4934647 -10.5501704   0.59   1.05  69.92  11.79  83.35
14:    14 0.9695890  0.97400905  66.3531889   4.7613328   1.04   1.00  65.90  -5.32  62.62
15:    15 1.3164804  0.12338132  45.8041366  13.7857014   0.62   0.15  45.92 -15.40  31.29
16:    16 1.6677588  1.04966380  57.5633554   4.5623640   0.21   1.08  57.35  -5.10  53.54
17:    17 1.0036450  0.71407497  31.0918234 -11.3558847   1.00   0.74  31.62  12.69  46.05
18:    18 1.5087793  0.51299521  67.1384602  -4.3564547   0.40   0.54  66.66   4.87  72.47
19:    19 0.4057830  0.91004688  40.7792332   3.4610362   1.71   0.94  41.04  -3.87  39.82
20:    20 0.6391978  1.50747159  98.3354671          NA   1.43   1.53  96.99     NA  99.95

Please suggest a solution to the following issues -

  1. Is there a way to impute the missing values?
  2. Is there a way to assign weightage to rank columns so that the final rank can be more balanced?

CodePudding user response:

You could do something like this; although without context there is no way to know if the imputation approach (below is just filling missing with mean) is advisable or not:

# Create a function that returns the directional rank of x, given a midpoint m
f <- function(x,m,o) {
  x[is.na(x)] <- mean(x,na.rm=T)
  frankv(abs(x-m),order=o,na.last="keep")
} 

# apply the function to a,b,x,y, and set overall rank as the mean ranks
dt[, `:=`(
  a_rank = f(a,1,1),
  b_rank = f(b,1,-1),
  x_rank = f(x,50,1),
  y_rank = f(y,50,-1)
)][,rank:=mean(c(a_rank, b_rank,x_rank,y_rank)), by=index][order(rank)]

Output:

    index         a         b        x          y a_rank b_rank x_rank y_rank   rank
    <int>     <num>     <num>    <num>      <num>  <num>  <num>  <num>  <num>  <num>
 1:     2 0.8576135 2.0936665 60.38814 -20.532472      7    1.0     12      1  5.250
 2:    10 0.8897567 0.5241907 40.18885 -15.721442      5    6.0     11      3  6.250
 3:     1        NA 0.6998702 48.88876  -4.224968      4   11.0      2     10  6.750
 4:    14 0.8370342 0.6075478 54.70773 -14.617556      8    8.0      7      5  7.000
 5:     4 1.0906517 0.8821498 52.11352 -14.606401      3   17.0      5      6  7.750
 6:    17 1.2741985 1.4594983 48.57384 -12.870305     13    7.0      3      8  7.750
 7:     5 0.9305543 0.4867895 37.18588   7.399475      2    5.0     13     15  8.750
 8:    20 1.6803262 0.1910586 50.82466         NA     20    3.0      1     12  9.000
 9:    15 1.5744038 0.1660290 51.55922   6.879168     18    2.0      4     13  9.250
10:    11 1.1658910        NA 44.87816 -15.146677      9   18.5      8      4  9.875
11:     6 1.0028821 0.6447967 33.00591  19.091036      1    9.0     15     19 11.000
12:     8 0.8146700 0.8766541 52.35293   7.017843     10   16.0      6     14 11.500
13:     3 0.3896411 1.7663053 56.02307  11.313372     19    4.0      9     18 12.500
14:    19 0.6860470 1.3039822 59.03008   7.690422     14   10.0     10     16 12.500
15:     7 1.1926402 1.1284419 29.51742 -14.438932     11   15.0     18      7 12.750
16:    18 1.1193659 0.7123265 78.89102   7.877388      6   12.0     19     17 13.500
17:    13 1.2175907 0.9774861 36.96100  -5.309065     12   20.0     14      9 13.750
18:     9 1.3221883 0.8262287 31.05051  -2.621975     15   14.0     16     11 14.000
19:    12 1.5484195        NA 86.87724 -16.015362     17   18.5     20      2 14.375
20:    16 1.4967519 0.8098867 30.76287  21.001089     16   13.0     17     20 16.500
    index         a         b        x          y a_rank b_rank x_rank y_rank   rank

An alternative imputation approach might be knn imputation. You could try something like (you might want to consider the defaults and make changes to fit your use case):

library(caret)
dt_imputed = predict(preProcess(dt[, .(a,b,x,y)],method=c("knnImpute")),dt)

and then run a ranking algorithm on dt_imputed

CodePudding user response:

I don't really see what you're doing. but I would try something like this ang go from there...

library(data.table)
library(tidyverse)
set.seed(123)
dt = data.table(index=1:20, 
                a = c(NA, rnorm(19, mean = 1, sd = 0.5)), 
                b = c(rnorm(10, mean = 1, sd = 0.5), NA, NA, rnorm(8, mean = 1, sd = 0.5)) ,
                x = rnorm(20, mean = 50, sd = 20), 
                y = c(rnorm(19, mean = 0, sd = 10), NA))
dt %>% mutate(
  a_rank = rank(-(a-1)^2),
  b_rank = rank((b-1)^2),
  x_rank = rank((x-50)^2),
  y_rank = rank(-(y-0)^2),
  rank = (2*a_rank 1*b_rank x_rank y_rank)
  )

  • Related