I have a html file which I am trying to get data from. The website is this https://www.tv2.no/nyheter. I am trying to get all the news article from the website.
I do this wget -O news.html https://www.tv2.no/nyheter
this creates a local file for me.
Then I am trying to get all the articles having class article--nyheter. I try running this command
tr '\n' ' ' < news.html | grep -E "^<article >.*$"
but I did not got any result. The html structure is like this
<body>
<div>
<article class="article column large-4 small-12">
hello
</article>
</div>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
sample output as both of the below articles contain class name article--nyheter
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
I have to use grep, sed, curl, awk for this. Cannot use any other parser.
So my expected output is to get all the articles tag having a specific class. I want everything inside those article tags.
CodePudding user response:
Assumptions:
- there is some valid reason why a HTML-centric tool is not being used to parse out the desired sections
- input is formatted as in the question otherwise the proposed
sed
solution will likely not work correctly - extract the
<article> ... </article>
pairs where thearticle class
entry contains the stringarticle--nyheter
- OP's expected output has the two
article--nyheter
sections listed in reverse order; for now I'm going to assume that was some sort of typo and that there are no requirements to sort the two sections
One sed
idea using ranges to to extract the desired data:
sed -n '/<article class.*article--nyheter/,/<\/article>/p' news.html
This generates:
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336304/">
<figure class="image image__responsive" style="padding-bottom:51.312%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177"
data-srcset="https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=354&compression=92 2x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=265.5&compression=92 1.5x,https://www.cdn.tv2.no/images/14336482.jpg?imageId=14336482&panox=0&panoy=0&panow=100&panoh=50.993377483444&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=177&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t27 tm26">IEA: Mulig å nå 2-gradersmålet om løftene fra Glasgow holdes</h2>
</div>
</a>
</article>
<article class="article column large-4 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14336420/">
<figure class="image image__responsive" style="padding-bottom:115.452%;">
<img class="image__img lazyload" itemprop="image" title="" alt=""
src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI py 0Po5yUFQA7"
data-src="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398"
data-srcset="https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=688&height=796&compression=92 2x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=516&height=597&compression=92 1.5x,https://www.cdn.tv2.no/images/14336464.jpg?imageId=14336464&panox=0&panoy=0&panow=100&panoh=100&heighty=0&heightx=0&heightw=100&heighth=100&width=344&height=398&compression=92 1x">
</figure>
<div class="article__content">
<h2 class="article__title t26 tm20">Italienske jegere stoppet på vei ut av landet med 2.027 nedfryste
troster</h2>
</div>
</a>
</article>
If the input data is not formatted as presented in the question (eg, carriage returns/linefeeds are missing) then this sed
solution likely will not work; a more 'robust' parser would need to be built (eg, via awk
) ...