Home > Net >  Creating new columns with combinations of string patterns in R
Creating new columns with combinations of string patterns in R

Time:06-22

I have a data frame - in which I have a column with a lengthy string separated by _. Now I am interested in counting the patterns and several possible combinations from the long string. In the use case I provided below, you can find that I would like to count the occurrence of events A and B but not anything else.

If A and B repeat like A_B or B_A alone or if they repeats itself n number of times, I want to count them and also if there are several occurrences of those combinations.

Example data frame:

participant <- c("A", "B", "C")
trial <- c(1,1,2)
string_pattern <- c("A_B_A_C_A_B", "B_A_B_A_C_D_A_B", "A_B_C_A_B")

df <- data.frame(participant, trial, string_pattern)

Expected output:

   participant   trial  string_pattern   A_B  B_A  A_B_A  B_A_B B_A_B_A 
1. A               1    A_B_A_C_A_B      2    1    1      0     0
2. B               1    B_A_B_A_C_D_A_B  2    2    1      1     1
3. C               2    A_B_C_A_B        2    0    0      0     0

My code:


revised_df <- df%>%
                 dplyr::mutate(A_B = stringr::str_count(string_pattern, "A_B"),
                               B_A = stringr::str_count(string_pattern, "B_A"),
                               B_A_B = string::str_count(string_pattern, "B_A_B"))

My approach gets complicated as the number of combinations increases. Hence, looking for a better solution.

CodePudding user response:

Try: This checks each string row for current column name


library(dplyr)

df |> 
  mutate(A_B = 0, B_A = 0, A_B_A = 0, B_A_B = 0, B_A_B_A = 0) |> 
  mutate(across(A_B:B_A_B_A, ~ str_count(string_pattern, cur_column())))
  participant trial  string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1           A     1     A_B_A_C_A_B   2   1     1     0       0
2           B     1 B_A_B_A_C_D_A_B   2   2     1     1       1
3           C     2       A_B_C_A_B   2   0     0     0       0

CodePudding user response:

You could write a function to solve this:

m <- function(s){
  a <- seq(nchar(s)-1)
  start <- rep(a, rev(a))
  stop <- ave(start, start, FUN = \(x)seq_along(x) x)
  b <- substring(s, start, stop)
  gsub('(?<=\\B)|(?=\\B)', '_', b, perl = TRUE)
}

n <- function(x){
  names(x) <- x
  a <- strsplit(gsub("_", '', gsub("_[^AB] _", ':', x)), ':')
  b <- t(table(stack(lapply(a, \(y)unlist(sapply(y, m))))))
  data.frame(pattern=x, as.data.frame.matrix(b), row.names = NULL)
}
  

n(string_pattern)
          pattern A_B A_B_A B_A B_A_B B_A_B_A
1     A_B_A_C_A_B   2     1   1     0       0
2 B_A_B_A_C_D_A_B   2     1   2     1       1
3       A_B_C_A_B   2     0   0     0       0
  • Related