Home > Software design >  How should I programmatically change only certain NA values to a specified string of my choosing in
How should I programmatically change only certain NA values to a specified string of my choosing in

Time:06-29

So for part of the internal R&D project I'm working on for work, I need to efficiently and programmatically assign certain NA values to the string, BMNDITS (which stands for "Biomarker Not Detected in this Set"). For context, I work at a small biotech company where the service we provide is that we scan for small biomarkers present in various sample types from experiments being run by clients (which each have a unique sample set ID associated with them). So, they'll send us the samples, we scan the data for the various biomarkers, and then we return to them a heatmap and the actual data itself.

Oftentimes, clients run multiple experiments over time so they can eventually acquire enough relevant data. Well, if they collect enough samples from their various populations of interest, they'll want to have us merge and stack the data so all the data is stored in one nice, finalized, merged data frame. Sounds easy enough, right? Well, the issue is that because not all biomarkers are always present in each study, a lot of NAs get introduced. It's true that in any given study, one individual may have a biomarker present and another won't have it detected in their donated sample, so for that particular individual for that particular biomarker, it'll just be a single NA entry (sometimes a couple may occur in a row though) -- and that's fine because obviously we can't control when a biomarker will be present in a given individual since it's completely random.

The problem though is that when we stack the data on top of each other to create this final merged data frame, currently, if a biomarker is not observed in a given population/sample set ID, it'll just be a large amount of sequential NA values in a given column. This isn't very descriptive, in my opinion, and so I'm trying to create an R function that will go in and change those values from just being a regular old NA value to saying BMNDITS, just so that way when the researchers are looking at the actual data itself and want to do things with it, they can filter out what are true missing values and values that don't exist solely because they weren't observed for that given population.

So, I've created some fake data I'm using to simulate data that we might get from three separate experiments (which are stored in the three "toy" data frames I've created in the code provided below). If you run what I've created below, it will result in one "all" data frame at the end that consists of 30 observations from 30 (fake) individuals, where each biomarker is a column labeled "x1", "x2", etc. Again, since the point here is to try and simulate real data, I've made it so that sometimes a biomarker is present in one set and not all the others. This is why the column names aren't all the same and some have names that aren't present in the others.

# loading dplyr
library(dplyr)

# making a couple toy data frames
set.seed(42)
toy_df1 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df2 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))
toy_df3 <- as.data.frame(matrix(data = rnorm(n = 100, mean = 0, sd = 1), nrow = 10, ncol = 10))

# assigning the names of the various "biomarkers" for this fake data
names(toy_df1) <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10")
names(toy_df2) <- c("x1", "x2", "x3", "x5", "x6", "x7", "x8", "x9", "x10", "x11")
names(toy_df3) <- c("x1", "x3", "x4", "x5", "x7", "x8", "x9", "x10", "x11", "x13")

# adding a dummy SSID to each toy dataframe
toy_df1$SSID <- as.numeric(rep(24001, nrow(toy_df1))) # Sample set ID from the first study
toy_df2$SSID <- as.numeric(rep(24002, nrow(toy_df2))) # Sample set ID from the second study
toy_df3$SSID <- as.numeric(rep(24003, nrow(toy_df3))) # Sample set ID from the third study

# Creating the NA insertion/MakeNA() function I'll need
# to help simulate the randomness that the NA values have
# regarding where they exist in the data
NA_Insert_Inator <- function(x) {
  x %>% mutate(
    across(
      starts_with("x"), 
      function(.x, probMiss) {
        ifelse(runif(nrow(.)) < probMiss, NA, .x)
      },
      probMiss=0.1
    )
  )
}

# Using the above function to randomly replace values in each toy dataframe with NA
toy_df1 <- NA_Insert_Inator(toy_df1)
toy_df2 <- NA_Insert_Inator(toy_df2)
toy_df3 <- NA_Insert_Inator(toy_df3)

# merging the toy data sheets into the "Data All"-esque file; 
# this takes each dataframe and stacks  
# them on top of each other in sequential order of the SSIDs. 
# (Also, lastly I move the SSID columns to be the last columns in the toy_data_all dataframe)
toy_data_all <- bind_rows(toy_df1, toy_df2, toy_df3)
toy_data_all <- toy_data_all %>% select(-SSID, SSID)

So if you run the above code you should end up getting something that looks similar to this: See those long streaks of NA values? Those are the ones I'm trying to change

I've created the following R function to try and change these long streaks of NA values but I can't get it to work. I can initiate the function fine, but when I try to apply it to my toy_data_all data frame I just get a value of NULL in return. What I was expecting though was those long streaks of (specifically 10 since that's the number of fake participants in each study) NA values would be changed to the specified string of BMNDITS.

The way I have tried going about it is based off of using the SSID for each individual data frame. Specifically, if for any given column, if the values for a specific SSID are all equal to NA, change them to say BMNDITS. I'm not sure what's going wrong here and perhaps there is a better and more efficient way of going about this. Attempt here:

BMNDITS_Inator <- function(freshly_merged_df){
  some_new_df <- freshly_merged_df
  for (i in unique(some_new_df[['SSID']])){
    for (j in 1:ncol(some_new_df)){
      if (all(is.na(some_new_df[which(some_new_df[['SSID']] == i), j]))){
        some_new_df[which(some_new_df[['SSID']] == i), j] <- "BMNDITS"
      }
    }
  }

But yeah, this is where I'm stuck and would greatly appreciate anybody's help or input. Many thanks!

CodePudding user response:

We may use a group by approach - grouped by 'SSID', loop over all the columns (everything()) in across, then check if, all the values are NA, then replace with "BMNDITS" or else return the character converted value (as the example showed the columns are numeric class)

library(dplyr)
toy_data_all %>%
   group_by(SSID) %>% 
   mutate(across(everything(), ~ if(all(is.na(.x))) "BMNDITS" else 
           as.character(.x))) %>%
   ungroup
  • Related