Home > Software engineering >  Is there a simple (dplyr-friendly) method to mask a subset of subject IDs in a data frame?
Is there a simple (dplyr-friendly) method to mask a subset of subject IDs in a data frame?

Time:09-24

I sense there's a simple solution to this question but I can't quite find it.

Let's say I have a data frame with data from three studies each containing three subjects.

df <- tibble(ID = c("STUDY1_ID1001", "STUDY1_ID1002", "STUDY1_ID1003", 
                    "STUDY2_ID2003", "STUDY2_ID2005", "STUDY2_ID2008",
                    "STUDY3_ID3002", "STUDY3_ID3003", "STUDY3_ID3007"),
             DV = seq(from = 10, to = 50, length.out = 9))

> df
# A tibble: 9 x 2
  ID               DV
  <chr>         <dbl>
1 STUDY1_ID1001    10
2 STUDY1_ID1002    15
3 STUDY1_ID1003    20
4 STUDY2_ID2001    25
5 STUDY2_ID2002    30
6 STUDY2_ID2003    35
7 STUDY3_ID3001    40
8 STUDY3_ID3002    45
9 STUDY3_ID3003    50

I want to mask IDs from the second study (Study 2) without affecting the IDs from the other studies and while maintaining the original order of the subjects in Study 2, achieving something like the below:

df_mask <- tibble(ID = c("STUDY1_ID1001", "STUDY1_ID1002", "STUDY1_ID1003", 
                         "STUDY2_MASK1", "STUDY2_MASK2", "STUDY2_MASK3",
                         "STUDY3_ID3002", "STUDY3_ID3003", "STUDY3_ID3007"),
                  DV = seq(from = 10, to = 50, length.out = 9))

> df_mask
# A tibble: 9 x 2
  ID               DV
  <chr>         <dbl>
1 STUDY1_ID1001    10
2 STUDY1_ID1002    15
3 STUDY1_ID1003    20
4 STUDY2_MASK1     25
5 STUDY2_MASK2     30
6 STUDY2_MASK3     35
7 STUDY3_ID3001    40
8 STUDY3_ID3002    45
9 STUDY3_ID3003    50

Anyone have any suggestions?

CodePudding user response:

You can do this in fewer steps, but I find something like this more clear and easy to follow.

The key step is that I would split the ID into two fields: STUDY and ID. You can then search for the STUDY numbers that you want to mask and replace their value.

Then just paste the fields back together (or don't), ungroup, and select the columns you want.

library(dplyr)
library(tidyr)

studies_to_mask <- c("STUDY2")

df %>% 
  separate(ID, c("STUDY", "ID"), sep = "_") %>% 
  group_by(STUDY) %>% 
  mutate(ID_MASK = if_else(STUDY %in% studies_to_mask, paste0("MASK", 1:n()), ID),
         ID = paste(STUDY, ID_MASK, sep = "_")) %>% 
  ungroup() %>% 
  select(ID, DV)

Results in:

# A tibble: 9 x 2
  ID               DV
  <chr>         <dbl>
1 STUDY1_ID1001    10
2 STUDY1_ID1002    15
3 STUDY1_ID1003    20
4 STUDY2_MASK1     25
5 STUDY2_MASK2     30
6 STUDY2_MASK3     35
7 STUDY3_ID3002    40
8 STUDY3_ID3003    45
9 STUDY3_ID3007    50

CodePudding user response:

A base R option -

inds <- grepl('STUDY2_', df$ID)
df$ID[inds] <- paste0(sub('_.*', '_MASK', df$ID[inds]), seq(sum(inds)))
df

#   ID               DV
#  <chr>         <dbl>
#1 STUDY1_ID1001    10
#2 STUDY1_ID1002    15
#3 STUDY1_ID1003    20
#4 STUDY2_MASK1     25
#5 STUDY2_MASK2     30
#6 STUDY2_MASK3     35
#7 STUDY3_ID3002    40
#8 STUDY3_ID3003    45
#9 STUDY3_ID3007    50
  • Related