Home > Blockchain >  Extract content from div with awk/grep
Extract content from div with awk/grep

Time:04-01

Assuming following html code.

<div class='requirement'>
<div class='req-title'>
The quick brown fox jumps over the lazy dog
</div>
</div>

I want to extract The quick brown fox jumps over the lazy dog using tools like awk or sed, I'm pretty sure it can be done.

I know html parser is the right tools for this job, but this is the only time I'll be dealing with html content.

CodePudding user response:

Assuming that there are newlines (HTML does not require that) that is file.txt content is

<div class='requirement'>
<div class='req-title'>
The quick brown fox jumps over the lazy dog
</div>
</div>

you might use GNU AWK for this task following way

awk '/<div class=\x27req-title\x27>/{p=1;next}/<\x2fdiv>/{p=0}p{print}' file.txt

gives output

The quick brown fox jumps over the lazy dog

Explanation: If <div class='req-title'> is encountered set p to 1 and go to next line, if </div> is encountered set p to 0. If p print current line. Note that I used hexadecimal for character of special meaning. Warning this solution is frail and might fail even for little change, like for example if " are used rather than ', there is another attribute added to div tag &c.

(tested in gawk 4.2.1)

I know html parser is the right tools for this job, but this is the only time I'll be dealing with html content.

If you are allowed to install tools, please consider using hxselect, it allow extracting tag or content thereof, which are matching CSS selector, in this case it would be as simple as

cat file.txt | hxselect -i -c div.req-title

-i means case insensitive (HTML is case insensitive), -c means content only (do not include starting and ending tag) div.req-title is CSS selector meaning div which has class req-title. This should be more robust than GNU AWK solution.

CodePudding user response:

Assuming the part you want to print is a single line:

$ awk 'f{print; exit} $0=="<div class=\047req-title\047>"{f=1}' file
The quick brown fox jumps over the lazy dog

otherwise:

$ awk 'f{if ($0=="</div>") exit; print} $0=="<div class=\047req-title\047>"{f=1}' file
The quick brown fox jumps over the lazy dog
  • Related