I am trying to create a set of variables that prints out non-redundant values from another series of column if the values match the condition.
For example, my database looks something like this (but with many more columns):
Var_1 Var_2 Var_3
C21 S066X0A S069X9A
I618 D06 I629
H2710 J1029 C71
S066X9D S066X9D I618
If a value starts with [S][0][6][4-6]
or [I][6]
then I would like them printed, provided that the same value has not been printed before. So the output would look like:
Var_1 Var_2 Var_3 Out_1 Out_2
C21 S066X0A S069X9A S066X0A
I618 D06 I629 I618 I629
H2710 J1029 C71
S066X9D S066X9D I618 S066X9D I618
CodePudding user response:
I have taken the liberty of adding a third output column Out_3
in order to handle the scenario where there are 3 unprinted matches in a row
df <- textConnection('
Var_1,Var_2,Var_3
C21,S066X0A,S069X9A
I618,D06,I629
H2710,J1029,C71
S066X9D,S066X9D,I618
') |>
read.csv(header = TRUE)
df$Out_1 <- ''
df$Out_2 <- ''
df$Out_3 <- ''
lapply(seq_along(df), function(x) {
print(df[ x, ])
matches <- str_extract_all(df[ x, ], '^S06[4-6].*|^I6.*') |>
unlist() |>
na.omit() |>
as.character()
idx <- which((matches %in% c(df$Out_1, df$Out_2, df$Out_3)) == FALSE)
if (length(idx) > 0) {
matches <- matches[ idx ]
lapply(seq_along(matches), function(i) {
cn <- sprintf('Out_%s', i)
df[ x, cn ] <<- matches[ i ]
})
}
}) |> invisible()
print(df)
Var_1 Var_2 Var_3 Out_1 Out_2 Out_3
1 C21 S066X0A S069X9A S066X0A
2 I618 D06 I629 I618 I629
3 H2710 J1029 C71
4 S066X9D S066X9D I618 S066X9D S066X9D
CodePudding user response:
I think this will be way easier to deal with in a long format, as you can avoid looping over variables or needing to deal with a changing data structure and potentially lots of empty cells in the wide format.
longdat <- data.frame(id=seq(nrow(dat)), values=unlist(dat), row.names=NULL)
longdat <- longdat[order(longdat$id),]
## id values
##1 1 C21
##5 1 S066X0A
##9 1 S069X9A
##2 2 I618
##6 2 D06
##10 2 I629
##3 3 H2710
##7 3 J1029
##11 3 C71
##4 4 S066X9D
##8 4 S066X9D
##12 4 I618
Flagging particular values or determining uniqueness in groups is also then very straight-forward:
longdat$flag <- grepl("S06[4-6]|I6", longdat$values)
unique(longdat[longdat$flag,])
## id values flag
##Var_21 1 S066X0A TRUE
##Var_12 2 I618 TRUE
##Var_32 2 I629 TRUE
##Var_14 4 S066X9D TRUE
##Var_34 4 I618 TRUE
Where dat
was:
dat <- read.table(text="Var_1 Var_2 Var_3
C21 S066X0A S069X9A
I618 D06 I629
H2710 J1029 C71
S066X9D S066X9D I618", header=TRUE)