Home > Software design >  R How to extract everything in all paragraphs after a specific word?
R How to extract everything in all paragraphs after a specific word?

Time:10-05

Hello I'm looking for a R code to delete every words in all paragraphs after a specific term like. Example looking for "Talk:" and replace everything until a new pargraph. I tried regex and spend time but can't succeed ("fjeaofiz" always present).

x <- c("12 3456 789", "Talk: zpfozefpozjgzigzehgoi oezjgzogzjgoezjgo \r fjeaofiz ", "", NA, "Talk: 667")
stri_sub_all(x, stri_locate_all_regex(x, "^Talk:.*\r", omit_no_match=TRUE)) <- "***"
print(x)

My output should be :

x <-"12 3456 789", "***", "", NA, "***"

Any help ?

CodePudding user response:

If the aim is to remove anything that occurs after the string Talk including Talk, then this should work:

sub("^Talk.*", "***", x)
[1] "12 3456 789" "***"         ""            NA            "***"  

CodePudding user response:

You need to use

stri_sub_all(x, stri_locate_all_regex(x, "(?s)^Talk:.*", omit_no_match=TRUE)) <- "***"

The point here is to remove \r (your regex matched only the part of the line until CR char) and use (?s) with .* pattern to match the rest of the whole string, because stringi regex package uses ICU regex flavor and . does not match line break chars (like CR and LF) by default. (?s) enables . to match line breaks.

Probably a simpler approach is to use

sub("^Talk:.*", "***", x)

Here, the default TRE regex library is used and . matches line breaks by default in this regex flavor.

  • Related