In the data.table
below there are 4 columns. I want to set the ranking criteria of these columns in the following way -
Column a - The higher the value from midpoint value 1, the lower is the rank and vice versa
Column b - The higher the value from midpoint value 1, the higher is the rank and vice versa
Column x - The higher the value from midpoint value 50, the higher is the rank and vice versa
Column y - The higher the value from midpoint value 0, the lower is the rank and vice versa
Here is the solution, I have tried.
library(data.table)
set.seed(123)
dt = data.table(index=1:20,
a = c(NA, rnorm(19, mean = 1, sd = 0.5)),
b = c(rnorm(10, mean = 1, sd = 0.5), NA, NA, rnorm(8, mean = 1, sd = 0.5)) ,
x = rnorm(20, mean = 50, sd = 20),
y = c(rnorm(19, mean = 0, sd = 10), NA))
dt
dt[, a_rank := round(scales::rescale_mid(a, mid = 1, to = c(2, 0)), 2)]
dt[, b_rank := round(scales::rescale_mid(b, mid = 1, to = c(0, 2)), 2)]
dt[, x_rank := round(scales::rescale_mid(x, mid = 50, to = c(0, 100)), 2)]
dt[, y_rank := round(scales::rescale_mid(y, mid = 0, to = c(20, -20)), 2)]
dt[, rank := sum(a_rank, b_rank, x_rank, y_rank, na.rm = T), by = index]
index a b x y a_rank b_rank x_rank y_rank rank
1: 1 NA 1.75960886 10.1450302 -16.5104890 NA 1.78 11.25 18.44 31.47
2: 2 0.8239768 1.18869399 41.4544143 -4.6398724 1.21 1.21 41.69 5.18 49.29
3: 3 1.3517620 -0.02611141 52.3327457 8.2537986 0.58 0.00 52.27 -9.22 43.63
4: 4 0.9471643 0.31798127 32.1358486 5.1013255 1.06 0.34 32.63 -5.70 28.33
5: 5 0.3706757 0.89960949 56.6780588 -5.8948104 1.75 0.93 56.49 6.59 65.76
6: 6 1.8422179 1.43288970 58.2285984 -9.9678074 0.00 1.46 58.00 11.14 70.60
7: 7 1.4556956 0.94905837 49.3392768 1.4447570 0.46 0.97 49.36 -1.61 49.18
8: 8 1.1187151 1.31209374 0.6820361 -0.1430741 0.86 1.34 2.05 0.16 4.41
9: 9 1.6090543 1.47950269 101.4291629 -17.9028124 0.28 1.50 100.00 20.00 121.78
10: 10 0.3306129 1.83552741 45.8940149 0.3455107 1.79 1.86 46.01 -0.39 49.27
11: 11 1.3304101 NA 63.0238656 1.9023032 0.61 NA 62.66 -2.13 61.14
12: 12 0.7385438 NA 55.4753298 1.7472640 1.31 NA 55.32 -1.95 54.68
13: 13 1.3418728 1.02800837 70.4934647 -10.5501704 0.59 1.05 69.92 11.79 83.35
14: 14 0.9695890 0.97400905 66.3531889 4.7613328 1.04 1.00 65.90 -5.32 62.62
15: 15 1.3164804 0.12338132 45.8041366 13.7857014 0.62 0.15 45.92 -15.40 31.29
16: 16 1.6677588 1.04966380 57.5633554 4.5623640 0.21 1.08 57.35 -5.10 53.54
17: 17 1.0036450 0.71407497 31.0918234 -11.3558847 1.00 0.74 31.62 12.69 46.05
18: 18 1.5087793 0.51299521 67.1384602 -4.3564547 0.40 0.54 66.66 4.87 72.47
19: 19 0.4057830 0.91004688 40.7792332 3.4610362 1.71 0.94 41.04 -3.87 39.82
20: 20 0.6391978 1.50747159 98.3354671 NA 1.43 1.53 96.99 NA 99.95
Please suggest a solution to the following issues -
- Is there a way to impute the missing values?
- Is there a way to assign weightage to rank columns so that the final rank can be more balanced?
CodePudding user response:
You could do something like this; although without context there is no way to know if the imputation approach (below is just filling missing with mean) is advisable or not:
# Create a function that returns the directional rank of x, given a midpoint m
f <- function(x,m,o) {
x[is.na(x)] <- mean(x,na.rm=T)
frankv(abs(x-m),order=o,na.last="keep")
}
# apply the function to a,b,x,y, and set overall rank as the mean ranks
dt[, `:=`(
a_rank = f(a,1,1),
b_rank = f(b,1,-1),
x_rank = f(x,50,1),
y_rank = f(y,50,-1)
)][,rank:=mean(c(a_rank, b_rank,x_rank,y_rank)), by=index][order(rank)]
Output:
index a b x y a_rank b_rank x_rank y_rank rank
<int> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 2 0.8576135 2.0936665 60.38814 -20.532472 7 1.0 12 1 5.250
2: 10 0.8897567 0.5241907 40.18885 -15.721442 5 6.0 11 3 6.250
3: 1 NA 0.6998702 48.88876 -4.224968 4 11.0 2 10 6.750
4: 14 0.8370342 0.6075478 54.70773 -14.617556 8 8.0 7 5 7.000
5: 4 1.0906517 0.8821498 52.11352 -14.606401 3 17.0 5 6 7.750
6: 17 1.2741985 1.4594983 48.57384 -12.870305 13 7.0 3 8 7.750
7: 5 0.9305543 0.4867895 37.18588 7.399475 2 5.0 13 15 8.750
8: 20 1.6803262 0.1910586 50.82466 NA 20 3.0 1 12 9.000
9: 15 1.5744038 0.1660290 51.55922 6.879168 18 2.0 4 13 9.250
10: 11 1.1658910 NA 44.87816 -15.146677 9 18.5 8 4 9.875
11: 6 1.0028821 0.6447967 33.00591 19.091036 1 9.0 15 19 11.000
12: 8 0.8146700 0.8766541 52.35293 7.017843 10 16.0 6 14 11.500
13: 3 0.3896411 1.7663053 56.02307 11.313372 19 4.0 9 18 12.500
14: 19 0.6860470 1.3039822 59.03008 7.690422 14 10.0 10 16 12.500
15: 7 1.1926402 1.1284419 29.51742 -14.438932 11 15.0 18 7 12.750
16: 18 1.1193659 0.7123265 78.89102 7.877388 6 12.0 19 17 13.500
17: 13 1.2175907 0.9774861 36.96100 -5.309065 12 20.0 14 9 13.750
18: 9 1.3221883 0.8262287 31.05051 -2.621975 15 14.0 16 11 14.000
19: 12 1.5484195 NA 86.87724 -16.015362 17 18.5 20 2 14.375
20: 16 1.4967519 0.8098867 30.76287 21.001089 16 13.0 17 20 16.500
index a b x y a_rank b_rank x_rank y_rank rank
An alternative imputation approach might be knn imputation. You could try something like (you might want to consider the defaults and make changes to fit your use case):
library(caret)
dt_imputed = predict(preProcess(dt[, .(a,b,x,y)],method=c("knnImpute")),dt)
and then run a ranking algorithm on dt_imputed
CodePudding user response:
I don't really see what you're doing. but I would try something like this ang go from there...
library(data.table)
library(tidyverse)
set.seed(123)
dt = data.table(index=1:20,
a = c(NA, rnorm(19, mean = 1, sd = 0.5)),
b = c(rnorm(10, mean = 1, sd = 0.5), NA, NA, rnorm(8, mean = 1, sd = 0.5)) ,
x = rnorm(20, mean = 50, sd = 20),
y = c(rnorm(19, mean = 0, sd = 10), NA))
dt %>% mutate(
a_rank = rank(-(a-1)^2),
b_rank = rank((b-1)^2),
x_rank = rank((x-50)^2),
y_rank = rank(-(y-0)^2),
rank = (2*a_rank 1*b_rank x_rank y_rank)
)