Home > front end >  purrr approach for creating new columns through function with two arguments
purrr approach for creating new columns through function with two arguments


I'm pretty sure there's a way to get there, but I'm not able to find it.

I have a data frame with several columns. I now want to add new columns to my data frame containing the info when sampling from these columns (0/1). I have a tidy solution with across that works if I want to sample the same number of elements from each column. I also have a (even uglier) solution with across when sampling different number of elements from each column, but I was hoping for an easier solution with purrr where I just provide the column names as one argument and the number of elements to be sampled as another argument and then would get my new columns.

Any ideas?


df <- data.frame(x = runif(10),
                 y = runif(10),
                 z = runif(10))

df[1, 1] <- NA
df[2, 2] <- NA
df[3, 3] <- NA

sampling <- c(2, 3, 4)
names(sampling) <- c("random_x", "random_y", "random_z")

Solution for sampling the same number of elements

df %>%
                         as.integer(row_number() %in% sample(which(!is.na(.)), size = 3))),
                .names = "{.col}_random"))

Solution for sampling a different number of elements

df %>%
                         as.integer(row_number() %in% sample(which(!is.na(.)), size = sampling[str_detect(names(sampling), paste0(cur_column(), "$"))]))),
                .names = "{.col}_random"))

Desired purrr way

Sth. along those lines maybe?

df %>%
  map2(.x = c("x", "y", "z"),
       .y = sampling,
       .f = ~if_else(is.na(.x),
                     as.integer(row_number() %in% sample(which(!is.na(.x)), size = .y))))

Problem with the purrr way obviously is that I don't use the right syntax because I'm passing a character vector as .x and not columns from df.

Desired output

(leaving the randomness of the results aside)

           x          y         z x_random y_random z_random
1         NA 0.06686268 0.7663706       NA        0        0
2  0.7551366         NA 0.5550793        0       NA        1
3  0.7437531 0.61971712        NA        0        0       NA
4  0.5238451 0.57510689 0.7637622        1        0        0
5  0.9593917 0.17481769 0.4443493        0        0        0
6  0.2821633 0.86972254 0.2284449        0        0        0
7  0.3941531 0.61981285 0.8202302        0        0        1
8  0.1473573 0.58482156 0.9078447        0        1        1
9  0.7063327 0.77550907 0.9271699        1        1        1
10 0.6320678 0.06011700 0.2139956        0        1        0

CodePudding user response:

  • You should not use df %>% map2(...) if you are passing .x, .y separately to map2.
  • is.na(.x) is not correct since .x is character values (like "x", "y" and "z"). I have used df[[.x]] to subset the values.
  • Since we are not using df %>% ... so row_number() would not work, hence changed it to seq_along.

Here is an approach with map2_dfc to create new columns and we use bind_cols to bind it to original dataframe.


bind_cols(df, map2_dfc(.x = c("x", "y", "z"),
                       .y = sampling,
                       .f = ~tibble(!!paste0(.x, "_random") := 
           if_else(is.na(df[[.x]]), NA_integer_,
as.integer(seq_along(df[[.x]]) %in% sample(which(!is.na(df[[.x]])), size = .y))))))

#            x          y           z x_random y_random z_random
#1          NA 0.02358698 0.222022714       NA        0        1
#2  0.15099912         NA 0.878007560        0       NA        0
#3  0.20228598 0.92222805          NA        0        0       NA
#4  0.10955137 0.68713928 0.485866574        1        1        1
#5  0.57361508 0.56205208 0.367087414        1        1        0
#6  0.30534642 0.75997029 0.006055428        0        0        1
#7  0.76949447 0.78142772 0.279323093        0        0        0
#8  0.07178739 0.73181961 0.187739444        0        0        1
#9  0.52645525 0.48321814 0.213029355        0        1        0
#10 0.30858707 0.20973381 0.450931534        0        0        0

CodePudding user response:

Another possible solution, using purrr::imap_dfc:


mutate(df, imap_dfc(sampling, ~  (1:nrow(df) %in% sample(setdiff(1:nrow(df), 
    which(is.na(df[, str_sub(.y, nchar(.y))]))), .x))) * ifelse(is.na(df), NA, 1))

#>              x         y          z random_x random_y random_z
#> 1           NA 0.5784770 0.87429843       NA        0        0
#> 2  0.483728093        NA 0.87502533        0       NA        0
#> 3  0.294748405 0.3057474         NA        0        0       NA
#> 4  0.993350082 0.4282864 0.02936437        0        0        1
#> 5  0.344054454 0.4872465 0.65317911        0        0        1
#> 6  0.465265657 0.6721587 0.77952998        0        1        1
#> 7  0.659649583 0.9923243 0.01262495        1        1        0
#> 8  0.314616988 0.7686583 0.99389609        1        0        0
#> 9  0.009670492 0.1558185 0.73083388        0        0        0
#> 10 0.102769163 0.1543078 0.84348806        0        1        1
  • Related