Home > Software design >  How to extract some digits text, between many texts from a file in bash?
How to extract some digits text, between many texts from a file in bash?

Time:10-22

I want to scrape a .mhtml file using bash, originally I only use curl xidel to scrape the html file, but now the web has "something" that prevent me from scraping.

this is some of the content:

QuoteStrip-watchLiveLink">LIVE<img src=3D"https://static-redesign.cnbcfm.co=
m/dist/4db8932b7ac3e84e3f64.svg" alt=3D"Watch live logo" class=3D"QuoteStri=
p-watchLiveLogo"></a><a href=3D"https://www.cnbc.com/live-tv/" style=3D"col=
or: rgb(0, 47, 108);">SHARK TANK</a></div></div></div><div class=3D"QuoteSt=
rip-quoteStripSubHeader"><span>RT Quote</span><span> | <!-- -->Exchange</sp=
an><span> | <!-- -->USD</span></div><div class=3D"QuoteStrip-dataContainer"=
><div class=3D"QuoteStrip-lastTimeAndPriceContainer"><div class=3D"QuoteStr=
ip-lastTradeTime">Last | 11:46 PM EDT</div><div class=3D"QuoteStrip-lastPri=
ceStripContainer"><span class=3D"QuoteStrip-lastPrice">1,621.41</span><span=
 class=3D"QuoteStrip-changeDown"><img class=3D"QuoteStrip-changeIcon" src=
=3D"https://static-redesign.cnbcfm.com/dist/4ee243ff052e81044388.svg" alt=
=3D"quote price arrow down"><span>-6.2537</span><span> (<!-- -->-0.3842%<!-=
- -->)</span></span></div></div></div></div><div class=3D"PhoenixChartWrapp=

question: How can I get only 1,621.41 as output in bash?

My regular program:

#!/bin/bash
curl -s -o ~/Desktop/xau.html -- https://www.cnbc.com/quotes/XAU=
gold=$(xidel -se /html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1] ~/Desktop/xau.html | sed 's/\,//g')
echo $gold
exit 0

output: some numbers

CodePudding user response:

One difficulty is that the lines are broken about anywhere (=\n). First join the lines and then extract what you look for:

$ sed -En 's/=\n//g
s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!p
ta
N
b
:a
q' file
1,621.41

Or, with GNU sed and its -z option:

$ sed -Ez 's!=\n!!g;s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!' file
1,621.41

CodePudding user response:

I only use curl xidel to scrape the html file

xidel can open urls no problem, so no need for curl.

/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]
                                            ^

This particular div doesn't exist. There's only one. So this should work:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  /html/body/div[2]/div/div[1]/div[3]/div/div/div[1]/div[2]/div[3]/div/div[2]/span[1]
'

Also please be sure to quote the extraction-query. This will prevent situations where you'd otherwise have to escape lots of characters.

The website's HTML-source is minified. To have a better overview of all the HTML element-nodes I suggest you prettify the source again:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e . \
  --output-format=html --output-node-indent > ~/Desktop/xau.html

And that way you can see the query can be simplified to:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  //span[@]
'

Or alternatively from one of the JSONs in the <head>-node:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  parse-json(//script[@type="application/ld json"][2])/price
'
  • Related