How to parse the url, for example: https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
So that only virtualbox.org/virtualbox/6.1.36 remains?
TEST_URLS=(
https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe
https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2
)
for url in "${TEST_URLS[@]}"; do
without_proto="${url#*:\/\/}"
without_auth="${without_proto##*@}"
[[ $without_auth =~ ^([^:\/] )(:[[:digit:]] \/|:|\/)?(.*) ]]
PROJECT_HOST="${BASH_REMATCH[1]}"
PROJECT_PATH="${BASH_REMATCH[3]}"
echo "given: $url"
echo " -> host: $PROJECT_HOST path: $PROJECT_PATH"
done
CodePudding user response:
Using sed
to match whether a sub domain is present (no matter how deep) or not.
$ sed -E 's~[^/]*//(([^.]*\.) )?([^.]*\.[a-z] /[^0-9]*[0-9.] ).*~\3~' <<< "${TEST_URLS[0]}"
virtualbox.org/virtualbox/6.1.36
Or in a loop
for url in "${TEST_URLS[@]}"; do
sed -E 's~[^/]*//(([^.]*\.) )?([^.]*\.[a-z] /[^0-9]*[0-9.] ).*~\3~' <<< "$url"
done
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1
CodePudding user response:
With your shown samples here is an awk
solution. Written and tested in GNU awk
.
awk '
match($0,/https?:\/\/([^/]*)(\/.*)\//,arr){
num=split(arr[1],arr1,"/")
if(num>2){
for(i=2;i<=num;i ){
firstVal=(firstVal?firstVal:"") arr1[i]
}
}
else{
firstVal=arr[1]
}
print firstVal arr[2]
}
' Input_file
Explanation: Using awk
's match
function here. Using GNU awk
version of it, where it supports capturing groups getting stored into an array, making use of that functionality here. Using regex https?:\/\/([^/]*)(\/.*)
could be also written as ^https?:\/\/([^/]*)(\/.*)
where its getting created 2 capturing groups and creating arr also. Then checking if elements are more than 2 then keep last 2 else keep first 2(domain names one), then printing values as per requirement.
CodePudding user response:
I tought about regex but cut makes this work easy.
url=https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
echo $url | grep -Po '([^\/]*)(?=[0-9\.]*)(.*)\/' | cut -d '/' -f 3-
Result
virtualbox.org/virtualbox/6.1.36
CodePudding user response:
So, if I am correct in assuming that you need to extract a string of the form...
hostname.tld/dirname
...where tld is the top-level domain and dirname is the path to the file.
So filtering out any url scheme and subdomains at the beginning, then also filtering out any file basename at the end? All solutions have assumptions. Assuming one of the original thee letter top level domains ie. .com, .org, .net, .int, .edu, .gov, .mil.
This possible solution uses sed
with the -r
option for the regular expressions extension.
It creates two filters and uses them to chop off the ends that you don't want (hopefully).
It also uses a capture group in filter_end
, so as to keep the /
in the filter.
test_urls=(
'https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe'
'https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe'
'https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2'
)
for url in ${test_urls[@]}
do
filter_start=$(
echo "$url" | \
sed -r 's/([^.\/][a-z] \.[a-z]{3})\/.*//' )
filter_end=$(
echo "$url" | \
sed 's/.*\(\/\)/\1/g' )
out_string="${url#$filter_start}"
out_string="${out_string%$filter_end}"
echo "$out_string"
done
Output:
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1