Home > Back-end >  Download web page and remove content except for one html table
Download web page and remove content except for one html table

Time:05-26

I am given a large html report from another department quite regularly that requires a fair amount of manual work to edit into a format that is required.

I'd like to work a bit smarter. I can download the page via:

wget -qO- <https://the_page.html>

However I just want to carve out a table that begins:

<!-- START Daily Keystroke

It goes on and on for many lines of html and always ends:

</table>
</div>
</div>

Before the next load of data is begun. I need everything in between these patterns in one chunk of text /file.

I played around with sed and awk which I am not really familiar with but it seems without knowing how many lines are going to be in the file each time these tools are not appropriate for this task. It seems something that can work more on specific patterns is appropriate.

That being the case I can install other utilities potentially. If anyone has any experience of something that might work?

CodePudding user response:

I played around with sed and awk

Be warned that these are best suited for working with things which might be described using regular expressions, HTML could not be. HTML parsers are devices which are destined for usage with HTML documents. Generally you should avoid using regular expression for dealing with Chomsky Type-2 contraptions.

That being the case I can install other utilities potentially. If anyone has any experience of something that might work?

I suggest trying hxselect as it allows easy extraction of element(s) matching CSS selector. It does use stdin so you might pipe output into it, consider following example: I want to download www.example.com page and extract its' title tag, then I can do:

wget -q -O - https://www.example.com | hxselect -i 'title'

if you encounter some ill-formed HTML you might use hxclean which will try to make it acceptable to hxselect like so

wget -q -O - https://www.example.com | hxclean | hxselect -i 'title'

If either of above works with your URL then you might start looking for CSS selector which describe only table you want to extract. See CSS selectors reference for available features. I am unable to craft selector without seeing whole source of page.

CodePudding user response:

Suggesting gawk cutting on first multi-line record. Followed by sed, head trimming till <!-- ....

gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" input.html |sed '0,/<!-- START Daily Keystroke/d'

Or without intermidiate file:

wget -qO- <https://the_page.html>| \
gawk 'NR==1{print}' RS="</table>\n</div>\n</div>" | \
sed '0,/<!-- START Daily Keystroke/d'
  • Related