Replace Nth occurrence of a word (substring) in a string in R, N is the value of an integer column-CodePudding

I want to find the Nth occurence of a word in an utterance and put [brackets] around it. I tried with various things but I think the closest I'm getting is with gsub but I can't have {copy-1} for the number of times in my regex. Any ideas? Can we put a variable in there? Thanks!

#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number()) 

#this is my gsub
gsub("((?:we){copy-1}.*)we", "\\[we\\]", df$utterance)

This would be my result

    utterance                         ID  copy
    <chr>                          <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

CodePudding user response：

How about just this:

library(tidyverse)

f <- function(s,c,target) {
 g = gregexpr(target,s)[[1]][c]
 if(is.na(g) | g<0) return(s)
 paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1 g length(target)))
}

df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))

Output:

  utterance                           ID  copy
  <chr>                            <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

Note that this will also find targets that are not whole words. For example the second of occurrence of "we" in "We went where we went yesterday" is the first two letters of "went", not the second occurrence of the word "we". If you want to restrict to whole words, you can update the gregexpr() call to this:

 g = gregexpr(paste0("\\b",target, "\\b"),s)[[1]][c]

CodePudding user response：

Here is a string splitting approach. We can split the input string on we, and then piece together, using [we] as the nth connector.

repn <- function(x, find, repl, n) {
    parts <- strsplit(x, paste0("\\b", find, "\\b"))[[1]]
    output <- paste0(
        paste0(parts[1:n], collapse=find),
        repl,
        paste0(parts[(n 1):length(parts)], collapse="we")
    )

    return(output)
}

x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)

[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"

CodePudding user response：

Here's a mixed approach using a number of additional packages:

library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
  rowid_to_column() %>%
  separate_rows(utterance, sep = " ") %>%
  group_by(rowid) %>%
  mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
         wordcount = cumsum(!is.na(wordcount))) %>% 
  mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>% 
  summarise(utterance = paste0(utterance, collapse = " ")) %>%
  bind_cols(.,df[,2:3])
# A tibble: 4 × 4
  rowid utterance                           ID  copy
  <int> <chr>                            <int> <int>
1     1 [we] are not who we think we are     1     1
2     2 we are not who [we] think we are     1     2
3     3 we are not who we think [we] are     1     3
4     4 they know who [we] are               2     1