I have a long data frame that looks like this
df <- data.frame(col1=c(1,2,3), col2=c(3,4,5), col3=c("s1:i:2", "s1:i:3", "dv:i:2"), col4=c("dv:i:4", "dv:i:4", "NA"))
>df
col1 col2 col3 col4
1 1 3 s1:i:2 dv:i:4
2 2 4 s1:i:3 dv:i:4
3 3 5 dv:i:2 NA
I am interested in the values that start with "dv". However, in my table, those values could be on different columns according to the presence or absence of other values. In this example, because of the lack of one s1 value, the dv value is not anymore in col4 but it is on col3. Do you know how I could write all the dv values on a single column? I guess using something like grepl could help, but I am unsure how to do it.
I forgot to add that "dv" could be almost in any column, not only col3 or col4, sorry about that.
CodePudding user response:
Could something simple like this work?
dat$col5 <- ifelse(grepl("dv", dat$col3), dat$col3,
ifelse(grepl("dv", dat$col4), dat$col4, NA))
col1 col2 col3 col4 col5
1 1 3 s1:i:2 dv:i:4 dv:i:4
2 2 4 s1:i:3 dv:i:4 dv:i:4
3 3 5 dv:i:2 <NA> dv:i:2
CodePudding user response:
Here is a tidyverse solution:
We first mutate
across
col3
and col4
check if they contain the string dv
,
if so then we put in the value of this row and
if not we put in a NA.
Then with using the .names
argument we unite
these columns:
library(dplyr)
library(tidyr)
df %>%
mutate(across(c(col3, col4), ~ifelse(str_detect(., "dv"), ., NA_character_), .names = 'new_{col}')) %>%
unite(Col_dv, starts_with('new'), na.rm = TRUE, sep = ' ')
col1 col2 col3 col4 Col_dv
1 1 3 s1:i:2 dv:i:4 dv:i:4
2 2 4 s1:i:3 dv:i:4 dv:i:4
3 3 5 dv:i:2 NA dv:i:2
CodePudding user response:
If you want to get rid of the initial "dv" from the value you can combine all the columns and then use a regex to identify the text that follows dv:
- two ways of doing this.
Option 1: stringi
library(stringi)
df$dv <- stri_match_first_regex(paste(df$col3, df$col4, sep = "_"),
"dv:(.*\\d)_*")[, 2]
df
Option 2: tidyr
dplyr
library(dplyr)
library(tidyr)
df |>
mutate(all = paste(col3, col4, sep= "_")) |>
extract(all, "dv", "dv:(.*\\d)_*")
Output
#> col1 col2 col3 col4 dv
#> 1 1 3 s1:i:2 dv:i:4 i:4
#> 2 2 4 s1:i:3 dv:i:4 i:4
#> 3 3 5 dv:i:2 NA i:2
Created on 2022-06-22 by the reprex package (v2.0.1)