Home > Software design >  How can I randomly modify a fraction of the values within a column to a new value based on a conditi
How can I randomly modify a fraction of the values within a column to a new value based on a conditi

Time:10-26

I would like to know if there is a way to negate a random fraction of the values in a single column based on the values in another column in R. In the example dataframe below, I'd like to be able to randomly select 10% of the exposure values to be the same magnitude, but negative values, but only for the rows that have "Toy" listed as an object.

df <- data.frame(ChildID=c("M1", "F1", "F1", "F2", "M2", "M3", "M3", "M3", "M3", "F3", "F1", "F2", "M2", "M3"),
                object=c("Mouth", "Toy", "Mouth", "Toy", "Toy", "Toy", "Mouth", "Toy", "Toy", "Mouth", "Toy", "Toy", "Toy", "Toy"),
                exposure=c(0.1, 0.2, 0.1, 0.05, 0.6, 0.1, 0.4, 0.1, 1.0, 0.5, 0.1, 0.4, 0.1, 1.0))

Here's what I would like the result to look like, for example.

Child ID object exposure
M1 Mouth 0.1
F1 Toy 0.2
F1 Mouth 0.1
F2 Toy 0.05
M2 Toy -0.6
M3 Toy 0.1
M3 Mouth 0.4
M3 Toy 0.1
M3 Toy 1.0
F3 Mouth 0.5
F1 Toy 0.1
F2 Toy 0.4
M2 Toy 0.1
M3 Toy 1.0

I tried using dplyr, but I can't filter it because that removes the other rows that I don't want to mutate. I realize this is a basic question, but I'm pulling my hair out trying to find the right work around. Thanks so much!

CodePudding user response:

You can get the rows with Toy using == and which, sample them and use them to subset and exchange the sign of exposure.

i <- which(df$object == "Toy")
i <- sample(i, round(length(i) / 10)) #In case 10% of Toy
#i <- sample(i, round(nrow(df) / 10)) #In case 10% of all
df$exposure[i] <- -df$exposure[i]

i
#[1] 12

df
#   ChildID object exposure
#1       M1  Mouth     0.10
#2       F1    Toy     0.20
#3       F1  Mouth     0.10
#4       F2    Toy     0.05
#5       M2    Toy     0.60
#6       M3    Toy     0.10
#7       M3  Mouth     0.40
#8       M3    Toy     0.10
#9       M3    Toy     1.00
#10      F3  Mouth     0.50
#11      F1    Toy     0.10
#12      F2    Toy    -0.40
#13      M2    Toy     0.10
#14      M3    Toy     1.00

Benchmark

library(tidyverse)
bench::mark(check=FALSE,
GKi = {i <- which(df$object == "Toy")
  i <- sample(i, round(length(i) / 10)) #In case 10% of Toy
  df$exposure[i] <- -df$exposure[i]
  df},
tmfmnk = {df %>%
            mutate(rowid = 1:n(),
                   exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((n()*10)/100)), -exposure, exposure)) %>%
            select(-rowid)},
AndS. = {df |>
  mutate(id = row_number()) |>
  filter(object == "Toy") |>
  slice_sample(prop = 0.1) |>
  mutate(exposure = -exposure) |>
  (\(d) bind_rows(d, filter(df, !row_number() %in% d$id)))()|>
  select(-id)})
#  expression      min  median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result
#  <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>
#1 GKi         11.23µs 12.99µs  74916.  2.49KB    60.0  9992     8   133ms <NULL>
#2 tmfmnk       1.49ms  1.55ms    639. 12.52KB    40.6   268    17   419ms <NULL>
#3 AndS.        4.93ms  5.02ms    198. 21.88KB    50.9    70    18   353ms <NULL>

GKi is about 100 times faster than tmfmnk and 350 times than AndS, allocates less memory and uses 125 characters, compared to 166 (tmfmnk) and 206 (AndS.).

CodePudding user response:

You could sample the rows you want with filter and slice_sample and then bind them to the original data, while removing the original rows.

library(tidyverse)


df |>
  mutate(id = row_number()) |>
  filter(object == "Toy") |>
  slice_sample(prop = 0.1) |>
  mutate(exposure = -exposure) |>
  (\(d) bind_rows(d, filter(df, !row_number() %in% d$id)))()|>
  select(-id)
#>    ChildID object exposure
#> 1       M3    Toy    -1.00
#> 2       M1  Mouth     0.10
#> 3       F1    Toy     0.20
#> 4       F1  Mouth     0.10
#> 5       F2    Toy     0.05
#> 6       M2    Toy     0.60
#> 7       M3    Toy     0.10
#> 8       M3  Mouth     0.40
#> 9       M3    Toy     0.10
#> 10      M3    Toy     1.00
#> 11      F3  Mouth     0.50
#> 12      F1    Toy     0.10
#> 13      F2    Toy     0.40
#> 14      M2    Toy     0.10

CodePudding user response:

One option might be:

df %>%
 mutate(rowid = 1:n(),
        exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((n()*10)/100)), -exposure, exposure)) %>%
 select(-rowid)

   ChildID object exposure exposure_new
1       M1  Mouth     0.10         0.10
2       F1    Toy     0.20         0.20
3       F1  Mouth     0.10         0.10
4       F2    Toy     0.05         0.05
5       M2    Toy     0.60         0.60
6       M3    Toy     0.10         0.10
7       M3  Mouth     0.40         0.40
8       M3    Toy     0.10         0.10
9       M3    Toy     1.00         1.00
10      F3  Mouth     0.50         0.50
11      F1    Toy     0.10         0.10
12      F2    Toy     0.40        -0.40
13      M2    Toy     0.10         0.10
14      M3    Toy     1.00         1.00

If the proportion should be computed from rows with a specific value only:

df %>%
 mutate(rowid = 1:n(),
        exposure_new = if_else(rowid %in% sample(rowid[object == "Toy"], floor((sum(object == "Toy")*10)/100)), -exposure, exposure)) %>%
 select(-rowid)
  •  Tags:  
  • r
  • Related