Using this command:
sed -n '/<article class.*article--nyheter/,/<\/article>/p' news2.html > onlyArticles.html
I get all these articles tags in my html document. They are about 50 articles.
Sample input:
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
I just want x number of articles. Like just top 2 articles.
Output:
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
<article class="article column large-12 small-12 article--nyheter">
... variable number of lines of dat
</article>
This is just an example. What I am trying to achieve is to select only (x) number of matching nodes.
Is there any way to do it? Cannot just use simple head
or tail
as I need to extract the matching elements not just some x amount of lines.
CodePudding user response:
xmllint
xpath
can be used requesting tags by position
xmllint --html --recover --xpath '//article[position()<=2]' tmp.html 2>/dev/null
CodePudding user response:
This might work for you (GNU sed):
sed -En '/<article/{:a;p;n;/<\/article>/!ba;p;x;s/^/x/;/x{2}/{x;q};x}' file
Turn off implicit printing and on extended regexp -En
.
Match and print lines between <article
and <\article>
then increment a counter in the hold space and quit processing if the number of occurrences is completed.
Alternative:
cat <<\! | sed -Enf - file
/<article/{
:a
p
n
/<\/article>/!ba
p
x
s/^/x/
/x{2}/{
x
q
}
x
}
!