I am trying to get the name of several webpages and this is an example of the dataset that I have:
c("https://arealdata-api.miljoeportal.dk/download/dai/BES_NATURTYPER_SHAPE.zip",
"https://download.kortforsyningen.dk/content/matrikelkortet",
"https://b0902-prod-dist-app.azurewebsites.net/geoserver/wfs",
"https://sit-ftp.statens-it.dk/main.html",
"https://arealdata.miljoeportal.dk/datasets/saerligtudpejede",
"https://miljoegis3.mim.dk/spatialmap?profile=privatskovtilskud",
"https://envs.au.dk/fileadmin/envs/Hjemmeside_2018/Zip_filer/Basemap03_public_geotiff.zip",
"https://arealdata-api.miljoeportal.dk/download/dai/BES_VANDLOEB_SHAPE.zip",
"https://wfs2-miljoegis.mim.dk/vp3basis2019/ows?service=WFS&version=1.0.0&request=GetCapabilities",
"httphttps://datasets.catalogue.data.gov.dk/dataset/ramsaromrader",
"https://ens.dk/service/statistik-data-noegletal-og-kort/download-gis-filer",
"https://miljoegis.mim.dk/cbkort?profile=miljoegis-raastofferhavet",
"https://www.marineregions.org/",
"https://CRAN.R-project.org/package=geodata>.",
"https://miljoegis3.mim.dk/spatialmap?profile=vandprojekter",
"https://landbrugsgeodata.fvm.dk/")
As an example for the first entry, I want to get the webpage "https://arealdata-api.miljoeportal.dk/" without the rest of the address, so erase "download/dai/BES_NATURTYPER_SHAPE.zip".
I was thinking something like keep everything between https://
and the first /
after that.
These are the variations I have tried so far:
# 1
URLS <- gsub(".*?//", "", URLS)
# 2
URLS <- gsub("http://", "", URLS)
# 3
URLS <- gsub(".*?//", "", URLS)
# 4
URLS <- gsub("/.*", "", URLS)
None of which works.
CodePudding user response:
We could capture ((...)
) the substring by matching one or more characters not a :
([^:]
) from the start (^
) of the string, followed by the :
and two slash //
, followed by characters not a slash and a slash and leave the rest of the characters out of group (.*
), replace with the backreference (\\1
) of the captured group
sub("^([^:] ://[^/] /).*", "\\1", URLS)
-output
[1] "https://arealdata-api.miljoeportal.dk/" "https://download.kortforsyningen.dk/"
[3] "https://b0902-prod-dist-app.azurewebsites.net/" "https://sit-ftp.statens-it.dk/"
[5] "https://arealdata.miljoeportal.dk/" "https://miljoegis3.mim.dk/"
[7] "https://envs.au.dk/" "https://arealdata-api.miljoeportal.dk/"
[9] "https://wfs2-miljoegis.mim.dk/" "httphttps://datasets.catalogue.data.gov.dk/"
[11] "https://ens.dk/" "https://miljoegis.mim.dk/"
[13] "https://www.marineregions.org/" "https://CRAN.R-project.org/"
[15] "https://miljoegis3.mim.dk/" "https://landbrugsgeodata.fvm.dk/"
CodePudding user response:
The other answer provides a better regex pattern, but I'd match with https:// as well, instead of simply getting everything from the beginning of the string on top of counting the slashes (see the 10th URL). I provided an alternative here, just for the fun of it.
my_ptrn <- paste(paste0("https://(.*)",
c(".dk", ".net", ".com", ".org")),
collapse = "|")
stringr::str_extract(URLS, my_ptrn)
#> [1] "https://arealdata-api.miljoeportal.dk"
#> [2] "https://download.kortforsyningen.dk"
#> [3] "https://b0902-prod-dist-app.azurewebsites.net"
#> [4] "https://sit-ftp.statens-it.dk"
#> [5] "https://arealdata.miljoeportal.dk"
#> [6] "https://miljoegis3.mim.dk"
#> [7] "https://envs.au.dk"
#> [8] "https://arealdata-api.miljoeportal.dk"
#> [9] "https://wfs2-miljoegis.mim.dk"
#> [10] "https://datasets.catalogue.data.gov.dk"
#> [11] "https://ens.dk"
#> [12] "https://miljoegis.mim.dk"
#> [13] "https://www.marineregions.org"
#> [14] "https://CRAN.R-project.org"
#> [15] "https://miljoegis3.mim.dk"
#> [16] "https://landbrugsgeodata.fvm.dk"
CodePudding user response:
Here a solution that was possible only with the help of @akrun (many thanks) using lookaround regex:
sapply(strsplit(URLS, "(?<=\\w/).", perl = TRUE), `[`, 1)
[1] "https://arealdata-api.miljoeportal.dk/" "https://download.kortforsyningen.dk/" "https://b0902-prod-dist-app.azurewebsites.net/"
[4] "https://sit-ftp.statens-it.dk/" "https://arealdata.miljoeportal.dk/" "https://miljoegis3.mim.dk/"
[7] "https://envs.au.dk/" "https://arealdata-api.miljoeportal.dk/" "https://wfs2-miljoegis.mim.dk/"
[10] "httphttps://datasets.catalogue.data.gov.dk/" "https://ens.dk/" "https://miljoegis.mim.dk/"
[13] "https://www.marineregions.org/" "https://CRAN.R-project.org/" "https://miljoegis3.mim.dk/"
[16] "https://landbrugsgeodata.fvm.dk/"