R Remove url without http or www-CodePudding

I am trying to remove urls that may or may not start with www in a large corpus with R.

For example, I would like to remove

ftse.com

My idea was to remove data that finish with .com and start with a space with gsub("\\s.*\\.com"," ",text)

By doing so, I remove all the part of the text starting with a space and finishing with .com. For instance:

gsub("\\s(www.)?.*\\.com"," ","this famous url ftse.com is appreciated")

[1] "this  is appreciated"

Instead of "this famous url is appreciated"

Any idea?

CodePudding user response：

How about this: "\\s(www\\.)?(\\w|\\-) ?\\.com"

This will also delete URLs containing dashes "-":

gsub(
  "\\s(www\\.)?(\\w|\\-) ?\\.com",
  "",
  c(
    "this famous url ftse.com www.test.com is appreciated",
    "this famous url www.test.com is appreciated",
    "this famous url www.test-test.com is appreciated"
  )
)
#> [1] "this famous url is appreciated" "this famous url is appreciated"
#> [3] "this famous url is appreciated"

^{Created on 2022-03-30 by the reprex package (v2.0.1)}

Explaination

\s match whitespace
(www\.)? match www. zero or one time
(\w|\-) ? match words or dashes one or more times, but as few times as possible
\.com match .com

CodePudding user response：

I think you need a better regex for matching the URL part. Here is an example:

urlpatt <- paste0(
   "\\s(https?:\\/\\/)?",                        # match protocol
   "[-a-zA-Z0-9@:%._\\ ~#=] ",                   # match (sub-) domain
   "\\.[a-zA-Z0-9()] ",                          # match top-level domain
   "\\b([-a-zA-Z0-9()!@:%_\\ .~#?&\\/\\/=]*)"    # match port & subdirectory
)

urlstr = "this famous url ftse.com is appreciated"

gsub(urlpatt, "", urlstr)

I tried to break it down by components for you, but most probably you need to improve it. I hope it good enough to get an idea though.