Home > OS >  bash download the first matching regex on download page
bash download the first matching regex on download page

Time:09-21

I want to get the newest (first) download link matching a regex.

URL=https://github.com/sharkdp/bat/releases/   # Need to look at /releases/ even though the downloads are under /releases/download/$REL/$BAT
content=$(wget $URL -q -O -)
# Parse $content for string starting 'https://' and ending "_amd64.deb"
# At the moment, that will be: href="/sharkdp/bat/releases/download/v0.18.3/bat_0.18.3_amd64.deb"
# wget -O to specify the name of the file into which wget dumps the page contents, and then - to get the dump onto standard output. -q (quiet) turns off wget output.

Then I need to somehow grep / match strings that starts https:// and ends _amd64. Then I need to just pick the first one in that list. How do I grep / match / pick first item in this way?

Once I have that, it's then easy for me to download the latest version on the page, with wget -P /tmp/ $DL

CodePudding user response:

With Bash, you can use

rx='href="(/sharkdp/[^"]*_amd64\.deb)"'
if [[ "$content" =~ $rx ]]; then 
    echo "${BASH_REMATCH[1]}";
else
    echo "No match";
fi
# => /sharkdp/bat/releases/download/v0.18.3/bat-musl_0.18.3_amd64.deb

The href="(/sharkdp/[^"]*_amd64\.deb)" regex matches href=", then captures into Group 1 (${BASH_REMATCH[1]}) /shardp/ zero or more chars other than " _amd64.deb and then just matches ".

With GNU grep, you can use

> link=$(grep -oP 'href="\K/sharkdp/[^"]*_amd64\.deb' <<< "$content" | head -1)
> echo "$link"
# => /sharkdp/bat/releases/download/v0.18.3/bat-musl_0.18.3_amd64.deb

Here,

  • href="\K/sharkdp/[^"]*_amd64\.deb - matches href=", then drops this text from the match, then matches /sharkdp/ any zero or more chars other than " and then _amd_64.deb
  • head -1 - only keeps the first match.
  • Related