Create a column based on elements on different columns that start with the same word-CodePudding

I have a long data frame that looks like this

df <- data.frame(col1=c(1,2,3), col2=c(3,4,5), col3=c("s1:i:2", "s1:i:3", "dv:i:2"), col4=c("dv:i:4", "dv:i:4", "NA"))


>df      
  col1 col2   col3   col4
1    1    3 s1:i:2 dv:i:4
2    2    4 s1:i:3 dv:i:4
3    3    5 dv:i:2     NA

I am interested in the values that start with "dv". However, in my table, those values could be on different columns according to the presence or absence of other values. In this example, because of the lack of one s1 value, the dv value is not anymore in col4 but it is on col3. Do you know how I could write all the dv values on a single column? I guess using something like grepl could help, but I am unsure how to do it.

I forgot to add that "dv" could be almost in any column, not only col3 or col4, sorry about that.

CodePudding user response：

Could something simple like this work?

dat$col5 <- ifelse(grepl("dv", dat$col3), dat$col3,
                   ifelse(grepl("dv", dat$col4), dat$col4, NA))

  col1 col2   col3   col4   col5
1    1    3 s1:i:2 dv:i:4 dv:i:4
2    2    4 s1:i:3 dv:i:4 dv:i:4
3    3    5 dv:i:2   <NA> dv:i:2

CodePudding user response：

Here is a tidyverse solution:

We first mutate across col3 and col4

check if they contain the string dv,

if so then we put in the value of this row and

if not we put in a NA.

Then with using the .names argument we unite these columns:

library(dplyr)
library(tidyr)

df %>% 
  mutate(across(c(col3, col4), ~ifelse(str_detect(., "dv"), ., NA_character_),  .names = 'new_{col}')) %>% 
  unite(Col_dv, starts_with('new'), na.rm = TRUE, sep = ' ')

  col1 col2   col3   col4 Col_dv
1    1    3 s1:i:2 dv:i:4 dv:i:4
2    2    4 s1:i:3 dv:i:4 dv:i:4
3    3    5 dv:i:2     NA dv:i:2

CodePudding user response：

If you want to get rid of the initial "dv" from the value you can combine all the columns and then use a regex to identify the text that follows dv: - two ways of doing this.

Option 1: stringi

library(stringi)

df$dv <- stri_match_first_regex(paste(df$col3, df$col4, sep = "_"),
                                "dv:(.*\\d)_*")[, 2]

df

Option 2: tidyr dplyr

library(dplyr)
library(tidyr)

df |>
  mutate(all = paste(col3, col4, sep= "_")) |> 
  extract(all, "dv", "dv:(.*\\d)_*")

Output

#>   col1 col2   col3   col4  dv
#> 1    1    3 s1:i:2 dv:i:4 i:4
#> 2    2    4 s1:i:3 dv:i:4 i:4
#> 3    3    5 dv:i:2     NA i:2

^{Created on 2022-06-22 by the reprex package (v2.0.1)}