I have a lot of HTML files with a lot of different content where I'm extracting always a specific part of it with a command line tool called pup
. The extract contains sometimes tags which can look like this:
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>
... or like this:
<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>
... or even like this:
<a class="someclasses"
href="mailto:[email protected]" js-class>
email
</a>
What I'm trying to do is to ...
- ... extract href value and anchor-text (the text between
<a ...>
and</a>
). - ... put both extracts in a seperate line but in the reverse order: First the text, than the href value.
- ... put three characters in front of every href value:
=>
So the result looks for example like this:
Visit Duck Duck Go!
=> https://www.duckduckgo.com
I'm able to get what I want with some concatinated sed
commands and some RegEx by creating groups/patterns and switching their printed order, if everything is in one line, just like in the first example. But I have no clue how to get what I want if the anchor tag is spred over several lines. I tried to achive my goal only with sed
but I had no luck. Yesterday I've been reading about similar problems from other people and that sed
is not ment to work over a linebreak beyond. Is this true? Could awk
do this? Are there any other tools I could use?
CodePudding user response:
You can try this bash
script though it may not be as efficient as the tools mentioned in the comments.
$ cat input_file
<a class="someclasses"
href="mailto:[email protected]" js-class>
email
</a>
<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>
#!/usr/bin/env bash
IFS=$'\n'
i=0
count=$(( $(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' input_file | sed '/^$/d' | wc -l) - 1 ))
while [[ "$i" -le "$count" ]];
do for f in input_file; do
first=($(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' "$f" | sed '/^$/d'))
second=($(sed -En 's|.*href="(.[^ ]*)".*|\1|p;' "$f"))
echo "${first[$i]}" $'\n' " => ${second[$i]}"
((i ))
done
done
Output
email
=> mailto:[email protected]
Visit Duck Duck Go!
=> https://www.duckduckgo.com
anchor text
=> https://www.stackoverflow.com
CodePudding user response:
It could be done parsing HTML fragments with xmllint
and xpath
expressions
frag=$(cat <<EOF
<div>
<a class="someclasses"
href="mailto:[email protected]" js-class>
email
</a>
<a class="someclasses"
href="http://example.com">
URL
</a>
<a class="someclasses"
href="http://example.com/2">
URL 2
</a>
</div>
EOF
)
while read -r line; do
if [ "${line%=*}" == 'href' ]; then
url=$(cut -d '=' -f 2 <<<"$line" | tr -d '"')
elif [ -n "$line" ]; then
echo "$line"
echo "=> $url"
fi
done < <(echo "$frag" | xmllint --recover --html --xpath "//a/text()| //a/@href" -)
Result:
email
=> mailto:[email protected]
URL
=> http://example.com
URL 2
=> http://example.com/2
xmllint
could be used to parse HTML files directly also.