How do I loop through dummy data in R to randomly replace a random amount of values with NA using th-CodePudding

So in the biotech company I work for, we scan for small biomarkers that are present in various sample types from experiments being performed by research institutions. For example, some research institution/department is studying what biomarkers may be present in certain individuals of some population of interest. Perhaps over time, they do this kind of a study three times (not necessarily with the same individuals, though some of the same individuals may be present in the future studies -- this is case by case dependent), and then they want to merge the data into one single, finalized data frame. Sounds easy enough, right? Well, it can be a bit more complicated than just stacking the data sets on top of each other.

As you might imagine, sometimes a biomarker shows up in some individuals and not in others. Furthermore, sometimes, a biomarker shows up solely in one of the sampled populations and not any of the other ones sampled. I'm working on an internal R&D project for my company, and am trying to simulate data that we might get from three separate experiments (which are stored in the three "toy" data frames I've created in the code provided below). If you run what I've created below, it will result in one "all" data frame at the end that consists of 30 observations from 30 (fake) individuals, where each biomarker is a column labeled "x1", "x2", etc. Again, the point here is to try and simulate real data for an internal research project I'm working on. I've tried to simulate the fact that sometimes, a biomarker is present in one set and not all the others. This is why the column names aren't all the same and some have names that aren't present in the others.

# bringning in the best data management package there is!
library(dplyr)

# making a couple toy data frames used to try and create dummy examples of all the merged files
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))

names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")

# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study

# inserting each dummy dataframe into a list object for storage (might be useful later)
toy_data_list <- list(toy_df1, toy_df2, toy_df3)

# merging the toy data sheets into the "Data All"-esque file; this takes each dataframe and stacks them
# and stacks them on top of each other in sequential order of the SSIDs. 
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)

My issue is this: I'm having trouble trying to simulate the randomness of the NA occurrences within each "toy" dataframe. For instance, in toy_df_1, perhaps biomarker x1 was observed in all but the first, third, and seventh individuals. Also still in toy_df_1, perhaps x3 was only observed in only the even numbered observations, so observations 1, 3, 5, 7, and 9, have NA values. Etc.

I don't know how to best simulate this random data in each data frame, per each biomarker, and this is what I'm seeking help/input on here. I have provided below the currently (very rough) code ideas I'm working with. My current idea/thought process is this: I want to loop through each toy data frame stored in this toy dataframe list, which I have named toy_data_list, and then loop through each column within each of these toy data frames and take a random sample of the observations, and replace them with NA. This is the part I am having trouble with the most I think. I don't want the randomly selected amount of observations to always be 3 or 4. Ideally, sometimes it'd be all 10, 0, 1, 5, 3, 2, etc., see what I mean? The below code is my best attempt at doing this but I'm running into issues with that sample() function . If anybody has any better ideas or ways of doing this I'm all ears. Thank you!

# adding NA values to the original toy dataframes to simulate real data
amount_sampled <- c(0:10)

for (i in 1:length(df_list)){
  for (j in 1:nrow(df_list[[i]])){
    
    df_list[[i]][,j][sample(df_list[[i]][,j], size = sample(amount_sampled, 1))] <- NA
    
  }
}

CodePudding user response：

Here's one possibility, using the tidyverse, as you've already referenced the dplyr package.

lapply(
  toy_data_list,
  function(x) {
    x %>% mutate(
            across(
              starts_with("x"), 
              function(.x, probMiss) {
                ifelse(runif(nrow(.)) < probMiss, NA, .x)
              },
              probMiss=0.1
            )
          )
  }
)
[[1]]
           x1          x2          x3          x4          x5         x6         x7         x8         x9        x10  SSID
1   0.3126459 -0.22142515  1.46026211 -1.62140875 -0.38502931 -0.6746660  0.1067005  0.2994059 -0.9105534 -0.7903800 24001
2   1.1208476 -0.96215911  2.00945844 -0.06857533  1.37178507 -0.7397778 -0.5422158 -0.4377205  0.1139766  0.5933915 24001
3   0.4718309  0.18746019 -0.07793822 -0.06298292  0.15898320  1.1972270 -1.4657003 -0.4179386 -1.0482775 -0.1309050 24001
4   1.1276853  0.01224463  1.35434208  1.07570771 -0.20485705  0.3777403 -0.1340144         NA  0.2833195  1.4239725 24001
5   0.6777231  0.71228374 -1.02718379  1.12910075 -0.85209216  0.2747821  1.2911999  0.4103885  0.1970954         NA 24001
6          NA -0.75372581 -1.04117532  0.45014673 -0.80947218 -0.9039600 -0.4470329  1.2823981         NA -0.1676203 24001
7   2.0610343  0.03818993 -0.87062302  0.05058644  0.53339896  0.4048332 -0.6301996  1.4470047  0.9986759 -1.3834170 24001
8  -0.6503159  1.17842106  0.85313022 -0.12433022  0.06873847 -0.2391438 -0.5809780  3.1036738 -0.1078761 -0.2894947 24001
9  -1.3242377 -0.15320385 -0.62276019  0.91201146 -0.73678148  1.0743194 -2.3201595  0.5837672  2.2087385 -0.9970005 24001
10  0.6594713 -0.16169753  1.21570987  0.20997680 -0.46916210 -0.3147506 -0.6459044 -0.2545514 -1.6828389 -0.3871874 24001

[[2]]
            x1           x2          x3         x5          x6           x7         x8          x9        x10        x11  SSID
1  -0.80952120 -0.852213775 -0.10225035         NA  0.35699328 -1.023753807 -1.4154533  0.65442205 -0.6537130  0.8973321 24002
2  -0.86600469  0.899019904 -0.05125182 -1.4588262 -0.94233711 -0.782264488  0.5288418  0.18005542  1.4591325 -2.2149217 24002
3   0.70240734 -0.007170461          NA  1.7886008 -1.15256861 -2.689387176 -2.4385595  0.90978210 -1.5683487  0.4901286 24002
4  -0.16801229 -0.815786121          NA -0.1338524 -0.38385549 -0.329542362         NA          NA  0.5224013  0.5429560 24002
5  -0.75408520  0.421949205          NA  2.0317077  0.01031431  0.009952132  0.1447827  0.73121298  1.1986778  0.1234124 24002
6  -0.05423047  0.563885485  1.54842553  0.3135906 -1.04593866 -1.908321655  0.7276715 -0.06894155  1.2526561  0.1046509 24002
7  -0.63881539  1.291600125  0.45648495  0.1310152 -0.70227040  0.666267145 -1.9790600 -0.03900499 -0.2058987  0.4406734 24002
8  -0.15457321 -0.522821584 -1.13708264 -0.9432088 -0.29729621  0.366913581 -0.3035615          NA  0.4077156  0.2557932 24002
9   1.57088231  1.506271800  0.64951476  2.9289056  0.64291036 -0.825838160         NA -1.03556291 -0.7573311 -0.5181969 24002
10  1.20115726  0.552862461 -0.21124735  1.1622098          NA  1.932775511  0.7907731 -1.45997079 -1.5325551  0.1591070 24002

[[3]]
            x1          x3          x4          x5          x7         x8         x9        x10        x11        x13  SSID
1   0.36152389 -1.25924885 -0.95076857  0.06161209 -0.54000591 -0.1575174 -0.2598904 -1.7000069 -1.2643910  0.7219130 24003
2  -2.37950462 -0.09150055  0.77677357          NA -0.83614442 -0.3279649  1.2053670         NA  1.8507856 -0.1508595 24003
3   0.75848553 -1.32246030  1.11150668 -0.73996106          NA -1.0307599 -0.3888235 -0.4143549  0.2517389         NA 24003
4   0.61761461  1.09019804 -0.05473478  0.12734300  0.07943265  0.6863735 -0.5544003  1.7183085 -1.3484155 -1.5051942 24003
5           NA -0.40610207 -0.62706647          NA -1.34442562  0.8155673  0.4449365 -0.5956177  0.9458541 -0.9271672 24003
6  -0.03330626 -0.12127479 -1.00838960          NA  1.61370473  1.5947937 -1.6966612  0.1043491 -0.8805581 -0.2534976 24003
7           NA  0.06427669          NA -1.34485334  0.67379844 -1.3158128 -0.7097424  1.1209104  0.5570678 -0.8930275 24003
8  -1.33089411 -2.25094930  0.54535480          NA  1.42531584  0.7146470 -0.5501727 -0.6699045  0.1830541         NA 24003
9   0.56486217  0.63718879  3.40638747  0.57378460 -0.35473065  0.6521152 -0.1936541 -0.9496021 -0.9153089  1.0442554 24003
10 -1.40741364 -0.47946600  0.54382789  0.71474178          NA  0.3009699  0.5828164 -0.4312363 -1.8665087         NA 24003

probMiss is the desired probability that any one sample is missing.

The inner function,

              function(.x, probMiss) {
                ifelse(runif(nrow(.)) < probMiss, NA, .x)
              },

generates an (unnamed) vector of U(0,1) random variables and then replaces the existing value in the current column with NA if the corresponding U(0, 1) variate is less than probMiss.

across(starting_with("x"), ... applies this function to all columns of the current data frame whose names start with "x".

lapply(toy_data_list, ... applies this function to each data frame in the toy_data_list.