I'm pretty sure there's a way to get there, but I'm not able to find it.
I have a data frame with several columns. I now want to add new columns to my data frame containing the info when sampling from these columns (0/1). I have a tidy solution with across
that works if I want to sample the same number of elements from each column. I also have a (even uglier) solution with across
when sampling different number of elements from each column, but I was hoping for an easier solution with purrr
where I just provide the column names as one argument and the number of elements to be sampled as another argument and then would get my new columns.
Any ideas?
Data
df <- data.frame(x = runif(10),
y = runif(10),
z = runif(10))
df[1, 1] <- NA
df[2, 2] <- NA
df[3, 3] <- NA
sampling <- c(2, 3, 4)
names(sampling) <- c("random_x", "random_y", "random_z")
Solution for sampling the same number of elements
df %>%
mutate(across(everything(),
~if_else(is.na(.),
NA_integer_,
as.integer(row_number() %in% sample(which(!is.na(.)), size = 3))),
.names = "{.col}_random"))
Solution for sampling a different number of elements
df %>%
mutate(across(everything(),
~if_else(is.na(.),
NA_integer_,
as.integer(row_number() %in% sample(which(!is.na(.)), size = sampling[str_detect(names(sampling), paste0(cur_column(), "$"))]))),
.names = "{.col}_random"))
Desired purrr way
Sth. along those lines maybe?
df %>%
map2(.x = c("x", "y", "z"),
.y = sampling,
.f = ~if_else(is.na(.x),
NA_integer_,
as.integer(row_number() %in% sample(which(!is.na(.x)), size = .y))))
Problem with the purrr way obviously is that I don't use the right syntax because I'm passing a character vector as .x and not columns from df.
Desired output
(leaving the randomness of the results aside)
x y z x_random y_random z_random
1 NA 0.06686268 0.7663706 NA 0 0
2 0.7551366 NA 0.5550793 0 NA 1
3 0.7437531 0.61971712 NA 0 0 NA
4 0.5238451 0.57510689 0.7637622 1 0 0
5 0.9593917 0.17481769 0.4443493 0 0 0
6 0.2821633 0.86972254 0.2284449 0 0 0
7 0.3941531 0.61981285 0.8202302 0 0 1
8 0.1473573 0.58482156 0.9078447 0 1 1
9 0.7063327 0.77550907 0.9271699 1 1 1
10 0.6320678 0.06011700 0.2139956 0 1 0
CodePudding user response:
- You should not use
df %>% map2(...)
if you are passing.x
,.y
separately tomap2
. is.na(.x)
is not correct since.x
is character values (like"x"
,"y"
and"z"
). I have useddf[[.x]]
to subset the values.- Since we are not using
df %>% ...
sorow_number()
would not work, hence changed it toseq_along
.
Here is an approach with map2_dfc
to create new columns and we use bind_cols
to bind it to original dataframe.
library(dplyr)
library(purrr)
bind_cols(df, map2_dfc(.x = c("x", "y", "z"),
.y = sampling,
.f = ~tibble(!!paste0(.x, "_random") :=
if_else(is.na(df[[.x]]), NA_integer_,
as.integer(seq_along(df[[.x]]) %in% sample(which(!is.na(df[[.x]])), size = .y))))))
# x y z x_random y_random z_random
#1 NA 0.02358698 0.222022714 NA 0 1
#2 0.15099912 NA 0.878007560 0 NA 0
#3 0.20228598 0.92222805 NA 0 0 NA
#4 0.10955137 0.68713928 0.485866574 1 1 1
#5 0.57361508 0.56205208 0.367087414 1 1 0
#6 0.30534642 0.75997029 0.006055428 0 0 1
#7 0.76949447 0.78142772 0.279323093 0 0 0
#8 0.07178739 0.73181961 0.187739444 0 0 1
#9 0.52645525 0.48321814 0.213029355 0 1 0
#10 0.30858707 0.20973381 0.450931534 0 0 0
CodePudding user response:
Another possible solution, using purrr::imap_dfc
:
library(tidyverse)
mutate(df, imap_dfc(sampling, ~ (1:nrow(df) %in% sample(setdiff(1:nrow(df),
which(is.na(df[, str_sub(.y, nchar(.y))]))), .x))) * ifelse(is.na(df), NA, 1))
#> x y z random_x random_y random_z
#> 1 NA 0.5784770 0.87429843 NA 0 0
#> 2 0.483728093 NA 0.87502533 0 NA 0
#> 3 0.294748405 0.3057474 NA 0 0 NA
#> 4 0.993350082 0.4282864 0.02936437 0 0 1
#> 5 0.344054454 0.4872465 0.65317911 0 0 1
#> 6 0.465265657 0.6721587 0.77952998 0 1 1
#> 7 0.659649583 0.9923243 0.01262495 1 1 0
#> 8 0.314616988 0.7686583 0.99389609 1 0 0
#> 9 0.009670492 0.1558185 0.73083388 0 0 0
#> 10 0.102769163 0.1543078 0.84348806 0 1 1