Assuming following html code.
<div class='requirement'>
<div class='req-title'>
The quick brown fox jumps over the lazy dog
</div>
</div>
I want to extract The quick brown fox jumps over the lazy dog
using tools like awk
or sed
, I'm pretty sure it can be done.
I know html parser is the right tools for this job, but this is the only time I'll be dealing with html content.
CodePudding user response:
Assuming that there are newlines (HTML does not require that) that is file.txt
content is
<div class='requirement'>
<div class='req-title'>
The quick brown fox jumps over the lazy dog
</div>
</div>
you might use GNU AWK
for this task following way
awk '/<div class=\x27req-title\x27>/{p=1;next}/<\x2fdiv>/{p=0}p{print}' file.txt
gives output
The quick brown fox jumps over the lazy dog
Explanation: If <div class='req-title'>
is encountered set p
to 1
and go to next line, if </div>
is encountered set p
to 0
. If p
print
current line. Note that I used hexadecimal for character of special meaning. Warning this solution is frail and might fail even for little change, like for example if "
are used rather than '
, there is another attribute added to div
tag &c.
(tested in gawk 4.2.1)
I know html parser is the right tools for this job, but this is the only time I'll be dealing with html content.
If you are allowed to install tools, please consider using hxselect
, it allow extracting tag or content thereof, which are matching CSS selector, in this case it would be as simple as
cat file.txt | hxselect -i -c div.req-title
-i
means case insensitive (HTML is case insensitive), -c
means content only (do not include starting and ending tag) div.req-title
is CSS selector meaning div
which has class req-title
. This should be more robust than GNU AWK
solution.
CodePudding user response:
Assuming the part you want to print is a single line:
$ awk 'f{print; exit} $0=="<div class=\047req-title\047>"{f=1}' file
The quick brown fox jumps over the lazy dog
otherwise:
$ awk 'f{if ($0=="</div>") exit; print} $0=="<div class=\047req-title\047>"{f=1}' file
The quick brown fox jumps over the lazy dog