How can I randomly modify a fraction of the values within a column to a new value based on a conditi-CodePudding

I would like to know if there is a way to negate a random fraction of the values in a single column based on the values in another column in R. In the example dataframe below, I'd like to be able to randomly select 10% of the exposure values to be the same magnitude, but negative values, but only for the rows that have "Toy" listed as an object.

df <- data.frame(ChildID=c("M1", "F1", "F1", "F2", "M2", "M3", "M3", "M3", "M3", "F3", "F1", "F2", "M2", "M3"),
                object=c("Mouth", "Toy", "Mouth", "Toy", "Toy", "Toy", "Mouth", "Toy", "Toy", "Mouth", "Toy", "Toy", "Toy", "Toy"),
                exposure=c(0.1, 0.2, 0.1, 0.05, 0.6, 0.1, 0.4, 0.1, 1.0, 0.5, 0.1, 0.4, 0.1, 1.0))

Here's what I would like the result to look like, for example.

Child ID	object	exposure
M1	Mouth	0.1
F1	Toy	0.2
F1	Mouth	0.1
F2	Toy	0.05
M2	Toy	-0.6
M3	Toy	0.1
M3	Mouth	0.4
M3	Toy	0.1
M3	Toy	1.0
F3	Mouth	0.5
F1	Toy	0.1
F2	Toy	0.4
M2	Toy	0.1
M3	Toy	1.0

I tried using dplyr, but I can't filter it because that removes the other rows that I don't want to mutate. I realize this is a basic question, but I'm pulling my hair out trying to find the right work around. Thanks so much!

CodePudding user response：

You can get the rows with Toy using == and which, sample them and use them to subset and exchange the sign of exposure.

i <- which(df$object == "Toy")
i <- sample(i, round(length(i) / 10)) #In case 10% of Toy
#i <- sample(i, round(nrow(df) / 10)) #In case 10% of all
df$exposure[i] <- -df$exposure[i]

i
#[1] 12

df
#   ChildID object exposure
#1       M1  Mouth     0.10
#2       F1    Toy     0.20
#3       F1  Mouth     0.10
#4       F2    Toy     0.05
#5       M2    Toy     0.60
#6       M3    Toy     0.10
#7       M3  Mouth     0.40
#8       M3    Toy     0.10
#9       M3    Toy     1.00
#10      F3  Mouth     0.50
#11      F1    Toy     0.10
#12      F2    Toy    -0.40
#13      M2    Toy     0.10
#14      M3    Toy     1.00

Benchmark

library(tidyverse)
bench::mark(check=FALSE,
GKi = {i <- which(df$object == "Toy")
  i <- sample(i, round(length(i) / 10)) #In case 10% of Toy
  df$exposure[i] <- -df$exposure[i]
  df},
tmfmnk = {df %>%
            mutate(rowid = 1:n(),
                   exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((n()*10)/100)), -exposure, exposure)) %>%
            select(-rowid)},
AndS. = {df |>
  mutate(id = row_number()) |>
  filter(object == "Toy") |>
  slice_sample(prop = 0.1) |>
  mutate(exposure = -exposure) |>
  (\(d) bind_rows(d, filter(df, !row_number() %in% d$id)))()|>
  select(-id)})
#  expression      min  median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>
#1 GKi         11.23µs 12.99µs  74916.  2.49KB    60.0  9992     8   133ms <NULL>
#2 tmfmnk       1.49ms  1.55ms    639. 12.52KB    40.6   268    17   419ms <NULL>
#3 AndS.        4.93ms  5.02ms    198. 21.88KB    50.9    70    18   353ms <NULL>

GKi is about 100 times faster than tmfmnk and 350 times than AndS, allocates less memory and uses 125 characters, compared to 166 (tmfmnk) and 206 (AndS.).

CodePudding user response：

You could sample the rows you want with filter and slice_sample and then bind them to the original data, while removing the original rows.

library(tidyverse)


df |>
  mutate(id = row_number()) |>
  filter(object == "Toy") |>
  slice_sample(prop = 0.1) |>
  mutate(exposure = -exposure) |>
  (\(d) bind_rows(d, filter(df, !row_number() %in% d$id)))()|>
  select(-id)
#>    ChildID object exposure
#> 1       M3    Toy    -1.00
#> 2       M1  Mouth     0.10
#> 3       F1    Toy     0.20
#> 4       F1  Mouth     0.10
#> 5       F2    Toy     0.05
#> 6       M2    Toy     0.60
#> 7       M3    Toy     0.10
#> 8       M3  Mouth     0.40
#> 9       M3    Toy     0.10
#> 10      M3    Toy     1.00
#> 11      F3  Mouth     0.50
#> 12      F1    Toy     0.10
#> 13      F2    Toy     0.40
#> 14      M2    Toy     0.10

CodePudding user response：

One option might be:

df %>%
 mutate(rowid = 1:n(),
        exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((n()*10)/100)), -exposure, exposure)) %>%
 select(-rowid)

   ChildID object exposure exposure_new
1       M1  Mouth     0.10         0.10
2       F1    Toy     0.20         0.20
3       F1  Mouth     0.10         0.10
4       F2    Toy     0.05         0.05
5       M2    Toy     0.60         0.60
6       M3    Toy     0.10         0.10
7       M3  Mouth     0.40         0.40
8       M3    Toy     0.10         0.10
9       M3    Toy     1.00         1.00
10      F3  Mouth     0.50         0.50
11      F1    Toy     0.10         0.10
12      F2    Toy     0.40        -0.40
13      M2    Toy     0.10         0.10
14      M3    Toy     1.00         1.00

If the proportion should be computed from rows with a specific value only:

df %>%
 mutate(rowid = 1:n(),
        exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((sum(object == "Toy")*10)/100)), -exposure, exposure)) %>%
 select(-rowid)