R: Create set of variables that print all matched values in a series of column only if condition is-CodePudding

I am trying to create a set of variables that prints out non-redundant values from another series of column if the values match the condition.

For example, my database looks something like this (but with many more columns):

Var_1   Var_2   Var_3
C21    S066X0A  S069X9A
I618    D06     I629
H2710   J1029   C71
S066X9D S066X9D I618

If a value starts with [S][0][6][4-6] or [I][6] then I would like them printed, provided that the same value has not been printed before. So the output would look like:

Var_1   Var_2   Var_3     Out_1    Out_2    
C21     S066X0A S069X9A  S066X0A
I618    D06     I629     I618      I629
H2710   J1029   C71
S066X9D S066X9D I618     S066X9D   I618

CodePudding user response：

I have taken the liberty of adding a third output column Out_3 in order to handle the scenario where there are 3 unprinted matches in a row

df <- textConnection('
Var_1,Var_2,Var_3
C21,S066X0A,S069X9A
I618,D06,I629
H2710,J1029,C71
S066X9D,S066X9D,I618
') |> 
  read.csv(header = TRUE)

df$Out_1 <- ''
df$Out_2 <- ''
df$Out_3 <- ''

lapply(seq_along(df), function(x) {
  print(df[ x, ])
  matches <- str_extract_all(df[ x, ], '^S06[4-6].*|^I6.*') |>
    unlist() |>
    na.omit() |>
    as.character()
  idx <- which((matches %in% c(df$Out_1, df$Out_2, df$Out_3)) == FALSE)
  if (length(idx) > 0) {
    matches <- matches[ idx ]
    lapply(seq_along(matches), function(i) {
      cn <- sprintf('Out_%s', i)
      df[ x, cn ] <<- matches[ i ]
    })
  }
}) |> invisible()

print(df)

    Var_1   Var_2   Var_3   Out_1   Out_2 Out_3
1     C21 S066X0A S069X9A S066X0A              
2    I618     D06    I629    I618    I629      
3   H2710   J1029     C71                      
4 S066X9D S066X9D    I618 S066X9D S066X9D

CodePudding user response：

I think this will be way easier to deal with in a long format, as you can avoid looping over variables or needing to deal with a changing data structure and potentially lots of empty cells in the wide format.

longdat <- data.frame(id=seq(nrow(dat)), values=unlist(dat), row.names=NULL)
longdat <- longdat[order(longdat$id),]
##   id  values
##1   1     C21
##5   1 S066X0A
##9   1 S069X9A
##2   2    I618
##6   2     D06
##10  2    I629
##3   3   H2710
##7   3   J1029
##11  3     C71
##4   4 S066X9D
##8   4 S066X9D
##12  4    I618

Flagging particular values or determining uniqueness in groups is also then very straight-forward:

longdat$flag <- grepl("S06[4-6]|I6", longdat$values)
unique(longdat[longdat$flag,])
##       id  values flag
##Var_21  1 S066X0A TRUE
##Var_12  2    I618 TRUE
##Var_32  2    I629 TRUE
##Var_14  4 S066X9D TRUE
##Var_34  4    I618 TRUE

Where dat was:

dat <- read.table(text="Var_1   Var_2   Var_3
C21    S066X0A  S069X9A
I618    D06     I629
H2710   J1029   C71
S066X9D S066X9D I618", header=TRUE)