Hello I'm trying to get the content of my class top. All I need is the link (without any tags) and and the value of the span class title in bash. I do something like this (for test) but this dose not give any answer. What I am doing wrong ?
curl -s https://www.website.com/q?search=violet | grep -e "^<span class=\"top\">(.*?)</span>"
<div >
<span title="0"></span>
<span title="tex"></span>
<span title="test"></span>
<a href="https://www.website.com/a/1973">
<img width="100" height="40"
data-original="https://img.com/i?jpg=123">
</a>
<span >
<a href="https://www.website.com/a/1973">
<span >Violet test</span>
</a>
<span > 250
</span>
<a ></a>
</span>
<span > 2017</span>
</div>
<div id="n" >
<h5>Letter n</h5>
</div>
CodePudding user response:
As mentioned in comments, regular expressions are the wrong tool for working with HTML. One approach using a XSLT stylesheet and xsltproc
:
example.xslt:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="/">
<xsl:for-each select="//span[@class='top']">
<xsl:value-of select="a[@href]/@href" />
<xsl:text>	</xsl:text>
<xsl:value-of select="a[@href]/span[@class='title']" />
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Usage:
$ curl -s https://www.website.com/q?search=violet | xsltproc --html example.xslt -
https://www.website.com/a/1973 Violet test
CodePudding user response:
Suggesting RegExp pattern to match FIRST span
class only.
grep -oP '(?<=<span ] '
Tested for your sample:
age0
hsa
Encour
top
title
episode
info
Not sure if that was your intention.
If you need only FIRST span
classes closed element in same line.
grep -oP '(?<=<span ] (?=".*</span>)' input.1.txt
Tested for your sample:
age0
hsa
Encour
title
info
CodePudding user response:
Thanks everyone, I do this and it working. May be it's a bad idea but I will see later
(?<=<span >).*?(?=<\/span>)