Find latest link from a HTML page listing download locations-CodePudding

I'm trying to build an equivalent to the following github-specific code that works for finding the latest artifact available for download from https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master -- the download links look something like https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz.

# Working code for github.com, needs to be converted to fivem.net
LOCATION=$(curl -s https://api.github.com/repos/someuser/somerepo/releases/latest \
| grep "tag_name" \
| awk '{print "https://github.com/someuser/somerepo/archive/" substr($2, 2, length($2)-3) ".zip"}') \
; curl -L -o file.zip $LOCATION

The file has an incremental version number but not a sequential number, followed by a completely random hash.

How can I find the latest download link from the HTML page at https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master?

CodePudding user response：

We can build off the use of lynx dump, as suggested in Easiest way to extract the urls from an html page using sed or awk only --

#!/usr/bin/env bash

url_re='https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/([[:digit:]] )-([[:xdigit:]] )/fx.tar.xz'
newest_link_num=0
newest_link_content=
while read -r _ link; do
  [[ $link =~ $url_re ]] || continue
  if (( ${BASH_REMATCH[1]} > newest_link_num )); then
    newest_link_num=${BASH_REMATCH[1]}
    newest_link_content=$link
  fi
done < <(lynx -dump -listonly -hiddenlinks=listonly https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master)

echo "Newest link is: $newest_link_content"

As of this writing, it finishes with the following output:

Newest link is: https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5901-5db768d8bbb973ba27c81e424aea2910144a3100/fx.tar.xz

CodePudding user response：

I examined https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/ and latest links (version 5902 i.e. newest and version 5484 i.e. latest recommended) seems to have is-active class

<a  href="./5902-3c88d7752be75493078c1da898337b0abc2652ff/fx.tar.xz" style="display: block;">

as opposed to older versions. If possible you should use tools designed for working with HTML for dealing with HTML for example hxselect, however if you are not allowed to install such tools you might GNU AWK instead following way

wget -O - https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/ | awk 'BEGIN{RS="<|>"}/is-active/{sub(/^.*href="\./,"");sub(/".*/,"");print "https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master"$0}'

to get output

https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5902-3c88d7752be75493078c1da898337b0abc2652ff/fx.tar.xz
https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master/5848-4f71128ee48b07026d6d7229a60ebc5f40f2b9db/fx.tar.xz

Explanation: I inform GNU AWK that row separator (RS) is < or > so inside of starting and ending tag are treated as single row, then for row which contain is-active I replace everything up to href=". with empty string, i.e. delete it and then replace " and all behind it using empty string, i.e. delete it, then print contatenation of https://runtime.fivem.net/artifacts/fivem/build_proot_linux/master and extracted href's value.

(tested in gawk 4.2.1)