I sense there's a simple solution to this question but I can't quite find it.
Let's say I have a data frame with data from three studies each containing three subjects.
df <- tibble(ID = c("STUDY1_ID1001", "STUDY1_ID1002", "STUDY1_ID1003",
"STUDY2_ID2003", "STUDY2_ID2005", "STUDY2_ID2008",
"STUDY3_ID3002", "STUDY3_ID3003", "STUDY3_ID3007"),
DV = seq(from = 10, to = 50, length.out = 9))
> df
# A tibble: 9 x 2
ID DV
<chr> <dbl>
1 STUDY1_ID1001 10
2 STUDY1_ID1002 15
3 STUDY1_ID1003 20
4 STUDY2_ID2001 25
5 STUDY2_ID2002 30
6 STUDY2_ID2003 35
7 STUDY3_ID3001 40
8 STUDY3_ID3002 45
9 STUDY3_ID3003 50
I want to mask IDs from the second study (Study 2) without affecting the IDs from the other studies and while maintaining the original order of the subjects in Study 2, achieving something like the below:
df_mask <- tibble(ID = c("STUDY1_ID1001", "STUDY1_ID1002", "STUDY1_ID1003",
"STUDY2_MASK1", "STUDY2_MASK2", "STUDY2_MASK3",
"STUDY3_ID3002", "STUDY3_ID3003", "STUDY3_ID3007"),
DV = seq(from = 10, to = 50, length.out = 9))
> df_mask
# A tibble: 9 x 2
ID DV
<chr> <dbl>
1 STUDY1_ID1001 10
2 STUDY1_ID1002 15
3 STUDY1_ID1003 20
4 STUDY2_MASK1 25
5 STUDY2_MASK2 30
6 STUDY2_MASK3 35
7 STUDY3_ID3001 40
8 STUDY3_ID3002 45
9 STUDY3_ID3003 50
Anyone have any suggestions?
CodePudding user response:
You can do this in fewer steps, but I find something like this more clear and easy to follow.
The key step is that I would split the ID into two fields: STUDY and ID. You can then search for the STUDY numbers that you want to mask and replace their value.
Then just paste the fields back together (or don't), ungroup, and select the columns you want.
library(dplyr)
library(tidyr)
studies_to_mask <- c("STUDY2")
df %>%
separate(ID, c("STUDY", "ID"), sep = "_") %>%
group_by(STUDY) %>%
mutate(ID_MASK = if_else(STUDY %in% studies_to_mask, paste0("MASK", 1:n()), ID),
ID = paste(STUDY, ID_MASK, sep = "_")) %>%
ungroup() %>%
select(ID, DV)
Results in:
# A tibble: 9 x 2
ID DV
<chr> <dbl>
1 STUDY1_ID1001 10
2 STUDY1_ID1002 15
3 STUDY1_ID1003 20
4 STUDY2_MASK1 25
5 STUDY2_MASK2 30
6 STUDY2_MASK3 35
7 STUDY3_ID3002 40
8 STUDY3_ID3003 45
9 STUDY3_ID3007 50
CodePudding user response:
A base R option -
inds <- grepl('STUDY2_', df$ID)
df$ID[inds] <- paste0(sub('_.*', '_MASK', df$ID[inds]), seq(sum(inds)))
df
# ID DV
# <chr> <dbl>
#1 STUDY1_ID1001 10
#2 STUDY1_ID1002 15
#3 STUDY1_ID1003 20
#4 STUDY2_MASK1 25
#5 STUDY2_MASK2 30
#6 STUDY2_MASK3 35
#7 STUDY3_ID3002 40
#8 STUDY3_ID3003 45
#9 STUDY3_ID3007 50