I am trying to remove urls that may or may not start with www in a large corpus with R.
For example, I would like to remove
ftse.com
My idea was to remove data that finish with .com and start with a space with
gsub("\\s.*\\.com"," ",text)
By doing so, I remove all the part of the text starting with a space and finishing with .com. For instance:
gsub("\\s(www.)?.*\\.com"," ","this famous url ftse.com is appreciated")
[1] "this is appreciated"
Instead of "this famous url is appreciated"
Any idea?
CodePudding user response:
How about this: "\\s(www\\.)?(\\w|\\-) ?\\.com"
This will also delete URLs containing dashes "-":
gsub(
"\\s(www\\.)?(\\w|\\-) ?\\.com",
"",
c(
"this famous url ftse.com www.test.com is appreciated",
"this famous url www.test.com is appreciated",
"this famous url www.test-test.com is appreciated"
)
)
#> [1] "this famous url is appreciated" "this famous url is appreciated"
#> [3] "this famous url is appreciated"
Created on 2022-03-30 by the reprex package (v2.0.1)
Explaination
- \s match whitespace
- (www\.)? match www. zero or one time
- (\w|\-) ? match words or dashes one or more times, but as few times as possible
- \.com match .com
CodePudding user response:
I think you need a better regex for matching the URL part. Here is an example:
urlpatt <- paste0(
"\\s(https?:\\/\\/)?", # match protocol
"[-a-zA-Z0-9@:%._\\ ~#=] ", # match (sub-) domain
"\\.[a-zA-Z0-9()] ", # match top-level domain
"\\b([-a-zA-Z0-9()!@:%_\\ .~#?&\\/\\/=]*)" # match port & subdirectory
)
urlstr = "this famous url ftse.com is appreciated"
gsub(urlpatt, "", urlstr)
I tried to break it down by components for you, but most probably you need to improve it. I hope it good enough to get an idea though.