URL validation in R-CodePudding

I want a function to validate URL and I found librarian

is_valid_url <- function(string) {
    any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", string))
}

I tested it and everything works fine except URLs with domain names starting with s return FALSE. Here are the results using R (4.2.0):

> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "www.example.com"))
[1] FALSE
> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "http://example.com"))
[1] TRUE
> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "https://example.com"))
[1] TRUE
> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "123"))
[1] FALSE
> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "https://science.org"))
[1] FALSE
> any(grepl("(https?|ftp)://[^\\s/$.?#].[^\\s]*", "https://stackoverflow.com"))
[1] FALSE

Does anyone know how I can fix the regular expression to correctly validate URLs with domain name starting with s?

Thank you.

(FYI. The regular expression used is "@stephenhay" from https://mathiasbynens.be/demo/url-regex)

CodePudding user response：

In regrex this expression [^...] means negation, so the patterns is looking for strings that after "https://" desn't contains the of the characters inside.

is_valid_url <- function(string) {
  pattern <- "(https?|ftp)://[^ /$.?#].[^\\s]*" 
  stringr::str_detect(string, pattern)
}

url <- c(
  "www.example.com",
  "http://example.com",
  "https://example.com",
  "123",
  "https://science.org",
  "https://stackoverflow.com"
)

is_valid_url(url)
#> [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE

^{Created on 2022-10-04 by the reprex package (v2.0.1)}