Home > Software engineering >  R: remove every word that ends with ".htm"
R: remove every word that ends with ".htm"

Time:11-25

I have a df = desc with a variable "value" that holds long text and would like to remove every word in that variable that ends with ".htm" . I looked for a long time around here and regex expressions and cannot find a solution.

Can anyone help? Thank you so much!

I tried things like:

desc <- str_replace_all(desc$value, "\*.htm*$", "") 

But I get:

Error: '\*' is an unrecognized escape in character string starting ""\*"

CodePudding user response:

This regex:

  • Will Catch all that ends with .htm
  • Will not catch instances with .html
  • Is not dependent on being in the beginning / end of a string.
strings <- c("random text shouldbematched.htm notremoved.html matched.htm random stuff")

gsub("\\w \\.htm\\b", "", strings)

Output:

[1] "random text  notremoved.html  random stuff"

CodePudding user response:

I am not sure what exactly you would like to accomplish, but I guess one of those is what you are looking for:

words <- c("apple", "test.htm", "friend.html", "remove.htm")

# just replace the ".htm" from every string
str_replace_all(words, ".htm", "")

# exclude all words that contains .htm anywhere
words[!grepl(pattern = ".htm", words)]

# exlude all words that END with .htm
words[substr(words, nchar(words)-3, nchar(words)) != ".htm"]

CodePudding user response:

I am not sure if you can use * to tell R to consider any value inside a string, so I would first remove it. Also, in your code you are setting a change in your variable "value" to replace the entire df.

So I would suggest the following:

desc$value <- str_replace(desc$value, ".htm", "")

By doing so, you are telling R to remove all .htm that you have in the desc$value variable alone. I hope it works!

CodePudding user response:

Let's assume you have, as you say, a variable "value" that holds long text and you want to remove every word that ends in .html. Based on these assumptions you can use str_remove all:

The main point here is to wrap the pattern into word boundary markers \\b:

library(stringr)
str_remove_all(value, "\\b\\w \\.html\\b")
[1] "apple  and test2.html01" "the word  must etc. and  as well" "we want to remove .htm"

Data:

value <- c("apple test.html and test2.html01", 
           "the word friend.html must etc. and x.html as well", 
           "we want to remove .htm")

CodePudding user response:

To achieve what you want just do:

desc$value <- str_replace(desc$value, ".*\\.htm$", "")

You are trying to escape the star and it is useless. You get an error because \* does not exist in R strings. You just have \n, \t etc...

\. does not exist either in R strings. But \\ exists and it produces a single \ in the resulting string used for the regular expression. Therefore, when you escape something in a R regexp you have to escape it twice:

In my regexp: .* means any chars and \\. means a real dot. I have to escape it twice because \ needs to be escape first from the R string.

  • Related