only select n number of matched lines from HTML file using bash-CodePudding

Using this command:

sed -n '/<article class.*article--nyheter/,/<\/article>/p' news2.html > onlyArticles.html

I get all these articles tags in my html document. They are about 50 articles.

Sample input:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

I just want x number of articles. Like just top 2 articles.

Output:

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

<article class="article column large-12 small-12 article--nyheter">
    ... variable number of lines of dat
</article>

This is just an example. What I am trying to achieve is to select only (x) number of matching nodes.

Is there any way to do it? Cannot just use simple head or tail as I need to extract the matching elements not just some x amount of lines.

CodePudding user response：

xmllint xpath can be used requesting tags by position

xmllint --html --recover --xpath '//article[position()<=2]' tmp.html 2>/dev/null

CodePudding user response：

This might work for you (GNU sed):

sed -En '/<article/{:a;p;n;/<\/article>/!ba;p;x;s/^/x/;/x{2}/{x;q};x}' file

Turn off implicit printing and on extended regexp -En.

Match and print lines between <article and <\article> then increment a counter in the hold space and quit processing if the number of occurrences is completed.

Alternative:

cat <<\! | sed -Enf - file
/<article/{
:a
p
n
/<\/article>/!ba
p            
x
s/^/x/
/x{2}/{
x     
q     
}
x
}
!