Linux bash parsing URL-CodePudding

How to parse the url, for example: https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe

So that only virtualbox.org/virtualbox/6.1.36 remains?

TEST_URLS=(
  https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
  https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe
https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2
)

for url in "${TEST_URLS[@]}"; do
     without_proto="${url#*:\/\/}"
  without_auth="${without_proto##*@}"
  [[ $without_auth =~ ^([^:\/] )(:[[:digit:]] \/|:|\/)?(.*) ]]
  PROJECT_HOST="${BASH_REMATCH[1]}"
  PROJECT_PATH="${BASH_REMATCH[3]}"

  echo "given: $url"
  echo "  -> host: $PROJECT_HOST path: $PROJECT_PATH"
done

CodePudding user response：

Using sed to match whether a sub domain is present (no matter how deep) or not.

$ sed -E 's~[^/]*//(([^.]*\.) )?([^.]*\.[a-z] /[^0-9]*[0-9.] ).*~\3~' <<< "${TEST_URLS[0]}" 
virtualbox.org/virtualbox/6.1.36

Or in a loop

for url in "${TEST_URLS[@]}"; do
  sed -E 's~[^/]*//(([^.]*\.) )?([^.]*\.[a-z] /[^0-9]*[0-9.] ).*~\3~' <<< "$url"
done

virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1

CodePudding user response：

With your shown samples here is an awk solution. Written and tested in GNU awk.

awk '
match($0,/https?:\/\/([^/]*)(\/.*)\//,arr){
  num=split(arr[1],arr1,"/")
  if(num>2){
    for(i=2;i<=num;i  ){
      firstVal=(firstVal?firstVal:"") arr1[i]
    }
  }
  else{
    firstVal=arr[1]
  }
  print firstVal arr[2]
}
'  Input_file

Explanation: Using awk's match function here. Using GNU awk version of it, where it supports capturing groups getting stored into an array, making use of that functionality here. Using regex https?:\/\/([^/]*)(\/.*) could be also written as ^https?:\/\/([^/]*)(\/.*) where its getting created 2 capturing groups and creating arr also. Then checking if elements are more than 2 then keep last 2 else keep first 2(domain names one), then printing values as per requirement.

CodePudding user response：

I tought about regex but cut makes this work easy.

url=https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe

echo $url | grep -Po '([^\/]*)(?=[0-9\.]*)(.*)\/' | cut -d '/' -f 3-

Result

virtualbox.org/virtualbox/6.1.36

CodePudding user response：

So, if I am correct in assuming that you need to extract a string of the form...

hostname.tld/dirname

...where tld is the top-level domain and dirname is the path to the file.

So filtering out any url scheme and subdomains at the beginning, then also filtering out any file basename at the end? All solutions have assumptions. Assuming one of the original thee letter top level domains ie. .com, .org, .net, .int, .edu, .gov, .mil.

This possible solution uses sed with the -r option for the regular expressions extension. It creates two filters and uses them to chop off the ends that you don't want (hopefully).

It also uses a capture group in filter_end, so as to keep the / in the filter.

test_urls=(
    'https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe'
    'https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe'
    'https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2'
)

for url in ${test_urls[@]}
do
    filter_start=$(
        echo "$url" | \
        sed -r 's/([^.\/][a-z] \.[a-z]{3})\/.*//' )

    filter_end=$(
        echo "$url" | \
        sed 's/.*\(\/\)/\1/g' )

    out_string="${url#$filter_start}"
    out_string="${out_string%$filter_end}"

    echo "$out_string"
done

Output:

virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1