How to extract href and the linked text or tag out of an <a> tag in the shell?-CodePudding

I have a lot of HTML files with a lot of different content where I'm extracting always a specific part of it with a command line tool called pup. The extract contains sometimes tags which can look like this:

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

... or like this:

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

... or even like this:

<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>

What I'm trying to do is to ...

... extract href value and anchor-text (the text between <a ...> and </a>).
... put both extracts in a seperate line but in the reverse order: First the text, than the href value.
... put three characters in front of every href value: =>

So the result looks for example like this:

Visit Duck Duck Go!
=> https://www.duckduckgo.com

I'm able to get what I want with some concatinated sed commands and some RegEx by creating groups/patterns and switching their printed order, if everything is in one line, just like in the first example. But I have no clue how to get what I want if the anchor tag is spred over several lines. I tried to achive my goal only with sed but I had no luck. Yesterday I've been reading about similar problems from other people and that sedis not ment to work over a linebreak beyond. Is this true? Could awk do this? Are there any other tools I could use?

CodePudding user response：

You can try this bash script though it may not be as efficient as the tools mentioned in the comments.

$ cat input_file
<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

#!/usr/bin/env bash

IFS=$'\n'
i=0
count=$(( $(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' input_file | sed '/^$/d' | wc -l) - 1 ))
while [[ "$i" -le "$count" ]];
    do for f in input_file; do
        first=($(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' "$f" | sed '/^$/d'))
        second=($(sed -En 's|.*href="(.[^ ]*)".*|\1|p;' "$f"))
        echo "${first[$i]}" $'\n' " => ${second[$i]}"
        ((i  ))
    done
done

Output

email
  => mailto:[email protected]
Visit Duck Duck Go!
  => https://www.duckduckgo.com
anchor text
  => https://www.stackoverflow.com

CodePudding user response：

It could be done parsing HTML fragments with xmllint and xpath expressions

frag=$(cat <<EOF
<div>
<a class="someclasses"
    href="mailto:[email protected]" js-class>
    email

</a>
<a class="someclasses"
    href="http://example.com">
    URL

</a>
<a class="someclasses"
    href="http://example.com/2">
    URL 2

</a>
</div>
EOF
)


while read -r line; do
    if [ "${line%=*}" == 'href' ]; then
        url=$(cut -d '=' -f 2 <<<"$line" | tr -d '"')
    elif [ -n "$line" ]; then
       echo "$line"
       echo "=> $url"
    fi
done < <(echo "$frag" | xmllint --recover --html --xpath "//a/text()| //a/@href" -)

Result:

email
=> mailto:[email protected]
URL
=> http://example.com
URL 2
=> http://example.com/2

xmllint could be used to parse HTML files directly also.