I want to find the Nth occurence of a word in an utterance and put [brackets] around it. I tried with various things but I think the closest I'm getting is with gsub but I can't have {copy-1} for the number of times in my regex. Any ideas? Can we put a variable in there? Thanks!
#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number())
#this is my gsub
gsub("((?:we){copy-1}.*)we", "\\[we\\]", df$utterance)
This would be my result
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
CodePudding user response:
How about just this:
library(tidyverse)
f <- function(s,c,target) {
g = gregexpr(target,s)[[1]][c]
if(is.na(g) | g<0) return(s)
paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1 g length(target)))
}
df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))
Output:
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
Note that this will also find targets
that are not whole words. For example the second of occurrence of "we" in "We went where we went yesterday" is the first two letters of "went", not the second occurrence of the word "we". If you want to restrict to whole words, you can update the gregexpr() call to this:
g = gregexpr(paste0("\\b",target, "\\b"),s)[[1]][c]
CodePudding user response:
Here is a string splitting approach. We can split the input string on we
, and then piece together, using [we]
as the nth connector.
repn <- function(x, find, repl, n) {
parts <- strsplit(x, paste0("\\b", find, "\\b"))[[1]]
output <- paste0(
paste0(parts[1:n], collapse=find),
repl,
paste0(parts[(n 1):length(parts)], collapse="we")
)
return(output)
}
x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)
[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"
CodePudding user response:
Here's a mixed approach using a number of additional packages:
library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
rowid_to_column() %>%
separate_rows(utterance, sep = " ") %>%
group_by(rowid) %>%
mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
wordcount = cumsum(!is.na(wordcount))) %>%
mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>%
summarise(utterance = paste0(utterance, collapse = " ")) %>%
bind_cols(.,df[,2:3])
# A tibble: 4 × 4
rowid utterance ID copy
<int> <chr> <int> <int>
1 1 [we] are not who we think we are 1 1
2 2 we are not who [we] think we are 1 2
3 3 we are not who we think [we] are 1 3
4 4 they know who [we] are 2 1