Using simulations, I want to test/ demonstrate the effects of "censored" data, where some cases are unavailable to us, or cases have values outside the measurement range of our instruments.
Here, I want to label cases as "observed" or "unobserved" based on the rank score of a numeric variable.
My attempts so far confuse tables with element values, but I don't know what to try next. I'm sure it will be head-smacking simple when I see some suggestions
## generate some data
n_rows <- 20
x <- rnorm(n_rows)
status <- rep("unobserved", n_rows)
data <- data.frame(x, status)
library(dplyr)
## how many observed cases?
n_observed <- 5
## Failure #1
data$status[data$x == dplyr::top_n(data$x, n_observed)] <- "observed"
#> Error in UseMethod("tbl_vars"): no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
## Failure #2
data$status[which((data$x == dplyr::top_n(data, x, n_observed)))] <- "observed"
#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used
## Failure #3
data$status[top_n(data, x, n_observed) %in% data] <- "observed"
#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used
CodePudding user response:
If you want ranks, then use rank
! Here are two examples, separately assigning observed status to the top and bottom five ranked values of x
.
data <- data.frame(x = sample(20), status1 = "unobserved", status2 = "unobserved")
data$status1[rank(data$x) <= 5] <- "observed"
data$status2[rank(-data$x) <= 5] <- "observed"
data
x status1 status2
1 2 observed unobserved
2 11 unobserved unobserved
3 3 observed unobserved
4 4 observed unobserved
5 14 unobserved unobserved
6 15 unobserved unobserved
7 1 observed unobserved
8 8 unobserved unobserved
9 7 unobserved unobserved
10 20 unobserved observed
11 13 unobserved unobserved
12 16 unobserved observed
13 9 unobserved unobserved
14 10 unobserved unobserved
15 17 unobserved observed
16 5 observed unobserved
17 18 unobserved observed
18 19 unobserved observed
19 12 unobserved unobserved
20 6 unobserved unobserved
You'll have to be slightly more careful if you expect x
to contain duplicates. rank
has an optional argument ties.method
that you can use to specify behaviour in that case.
As you have probably deduced from the warnings, dplyr::top_n
is intended for "data frame in, data frame out". It should not be used for indexing.