Change variable character value based on numeric rank of another variable-CodePudding

Using simulations, I want to test/ demonstrate the effects of "censored" data, where some cases are unavailable to us, or cases have values outside the measurement range of our instruments.

Here, I want to label cases as "observed" or "unobserved" based on the rank score of a numeric variable.

My attempts so far confuse tables with element values, but I don't know what to try next. I'm sure it will be head-smacking simple when I see some suggestions

## generate some data
n_rows <- 20

x <- rnorm(n_rows)
status <- rep("unobserved", n_rows)
data <- data.frame(x, status)

library(dplyr)

## how many observed cases?
n_observed <- 5


## Failure #1
data$status[data$x == dplyr::top_n(data$x, n_observed)] <- "observed"

#> Error in UseMethod("tbl_vars"): no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"


## Failure #2
data$status[which((data$x == dplyr::top_n(data, x, n_observed)))] <- "observed"

#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used


## Failure #3
data$status[top_n(data, x, n_observed) %in% data] <- "observed"

#> Warning in if (n > 0) {: the condition has length > 1 and only the first element will be used

CodePudding user response：

If you want ranks, then use rank! Here are two examples, separately assigning observed status to the top and bottom five ranked values of x.

data <- data.frame(x = sample(20), status1 = "unobserved", status2 = "unobserved")
data$status1[rank(data$x)  <= 5] <- "observed"
data$status2[rank(-data$x) <= 5] <- "observed"
data

    x    status1    status2
1   2   observed unobserved
2  11 unobserved unobserved
3   3   observed unobserved
4   4   observed unobserved
5  14 unobserved unobserved
6  15 unobserved unobserved
7   1   observed unobserved
8   8 unobserved unobserved
9   7 unobserved unobserved
10 20 unobserved   observed
11 13 unobserved unobserved
12 16 unobserved   observed
13  9 unobserved unobserved
14 10 unobserved unobserved
15 17 unobserved   observed
16  5   observed unobserved
17 18 unobserved   observed
18 19 unobserved   observed
19 12 unobserved unobserved
20  6 unobserved unobserved

You'll have to be slightly more careful if you expect x to contain duplicates. rank has an optional argument ties.method that you can use to specify behaviour in that case.

As you have probably deduced from the warnings, dplyr::top_n is intended for "data frame in, data frame out". It should not be used for indexing.