How to download URLs from the website and save them in a file (wget, curl)?-CodePudding

EDITION.

How can I download addresses to the file from the terminal?

Can it be done with the help of WGET?

Can it be done with the help of CURL?

CodePudding user response：

To save the output of the wget command that you provided above, add the following at the end of your command line:

-O ${vidfileUniqueName}.${fileTypeSuffix}

Before that wget, you will need to define something like the following:

vidfileUniqueName=$(echo "${URL}" | cut -f10 -d\/ )

fileTypeSuffix="mp4|avi|mkv"

You need to choose only one of the suffix types from that list and remove the others.

CodePudding user response：

A couple of scripts, one reusable for other stuff.

Script #1: "normalize" lines so that they are not too long (Reformatter_01.sh):

#!/bin/sh

BASE=`basename "$0" ".sh" `
TMP="/tmp/tmp.$$.${BASE}"

{
    if [ -n "${1}" ]
    then
        cat ${1}
    else
        cat
    fi 
} |
    sed 's ^\t  ' |
    awk 'BEGIN{
        splitter=";" ;
    }{
    rem=$0 ;
    if( rem != "" || rem !~ /\s*/ ){
        n=index( rem, "{" )
        while( n != 0 ){
            beg=substr( rem, 1, n ) ;
            print beg ;
            rem=substr( rem, n 1 ) ;
            n=index( rem, "{" ) ;
        } ;
        if( rem != "" || rem !~ /\s*/ ){
            print rem ;
        } ;
    } ;
}'

Script #2: Download and filter the page for URLs:

#!/bin/sh

URL="https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-2984/585ddf5a3dde69cb58c7f42ba52790a4"

BASE=$(basename "$0" ".sh")
TMP="${BASE}.tmp"
HARVESTED="${BASE}.harvest"
DISTILLED="${BASE}.urls"

if [ ! -s "${TMP}" ]
then
    curl "${URL}" >"${TMP}"
fi

if [ ! -s "${TMP}.2" ]
then
    Reformatter_01.sh "${TMP}" >"${TMP}.2" &
    PID1=$!
    
    tail -f "${TMP}.2" &
    PID2=$!
    
    wait ${PID1}
    kill -9 ${PID2}
    
    echo "\n\n  END of REFORMATTING ...\n\n"
fi

### Separate any URLs from the rest of the HTML/CSS/JavaScript
grep 'http://' "${TMP}.2" | sed 's http\: \nhttp\: g' | sed 's https\: \nhttps\: g' >"${TMP}.3"
grep 'https://' "${TMP}.2" | sed 's https\: \nhttps\: g' | sed 's http\: \nhttp\: g' >>"${TMP}.3"

### Ignore any image URLs clearly not related to videos
grep '^http' "${TMP}.3"     |
    cut -f1-2 -d\:      |
    grep -v '\.png\"'   |
    grep -v '\.svg\"'   |
    grep -v '\.json\"'  |
    grep -v '\.jpg\"'   |
    grep -v '\.jpeg\"'  |
    sort            |
    uniq > "${HARVESTED}"

echo "\n Harvested Links (view of remaining URLs):"
more "${HARVESTED}"

### Distill for only grouping you are interesed.  This could be a command line parameter. (NOTE: not same as full page URL)
grep 'https://polsatboxgo.pl/wideo/seriale/pierwsza-milosc/5027238/sezon-44/5027472/pierwsza-milosc-odcinek-' "${HARVESTED}" >"${DISTILLED}"

echo "\n Video Links:"
more "${DISTILLED}"

### add more coding for cleanup parsing  here