I have a df = desc with a variable "value" that holds long text and would like to remove every word in that variable that ends with ".htm" . I looked for a long time around here and regex expressions and cannot find a solution.
Can anyone help? Thank you so much!
I tried things like:
desc <- str_replace_all(desc$value, "\*.htm*$", "")
But I get:
Error: '\*' is an unrecognized escape in character string starting ""\*"
CodePudding user response:
This regex:
- Will Catch all that ends with
.htm
- Will not catch instances with
.html
- Is not dependent on being in the beginning / end of a string.
strings <- c("random text shouldbematched.htm notremoved.html matched.htm random stuff")
gsub("\\w \\.htm\\b", "", strings)
Output:
[1] "random text notremoved.html random stuff"
CodePudding user response:
I am not sure what exactly you would like to accomplish, but I guess one of those is what you are looking for:
words <- c("apple", "test.htm", "friend.html", "remove.htm")
# just replace the ".htm" from every string
str_replace_all(words, ".htm", "")
# exclude all words that contains .htm anywhere
words[!grepl(pattern = ".htm", words)]
# exlude all words that END with .htm
words[substr(words, nchar(words)-3, nchar(words)) != ".htm"]
CodePudding user response:
I am not sure if you can use * to tell R to consider any value inside a string, so I would first remove it. Also, in your code you are setting a change in your variable "value" to replace the entire df.
So I would suggest the following:
desc$value <- str_replace(desc$value, ".htm", "")
By doing so, you are telling R to remove all .htm that you have in the desc$value variable alone. I hope it works!
CodePudding user response:
Let's assume you have, as you say, a variable "value" that holds long text and you want to remove every word that ends in .html
. Based on these assumptions you can use str_remove all
:
The main point here is to wrap the pattern into word boundary markers \\b
:
library(stringr)
str_remove_all(value, "\\b\\w \\.html\\b")
[1] "apple and test2.html01" "the word must etc. and as well" "we want to remove .htm"
Data:
value <- c("apple test.html and test2.html01",
"the word friend.html must etc. and x.html as well",
"we want to remove .htm")
CodePudding user response:
To achieve what you want just do:
desc$value <- str_replace(desc$value, ".*\\.htm$", "")
You are trying to escape the star and it is useless. You get an error because \*
does not exist in R strings. You just have \n
, \t
etc...
\.
does not exist either in R strings. But \\
exists and it produces a single \
in the resulting string used for the regular expression. Therefore, when you escape something in a R regexp you have to escape it twice:
In my regexp: .*
means any chars and \\.
means a real dot. I have to escape it twice because \
needs to be escape first from the R string.