Get span content value witrh regex (bash)-CodePudding

Hello I'm trying to get the content of my class top. All I need is the link (without any tags) and and the value of the span class title in bash. I do something like this (for test) but this dose not give any answer. What I am doing wrong ?

curl -s  https://www.website.com/q?search=violet | grep -e "^<span class=\"top\">(.*?)</span>"


                        <div >
                            <span  title="0"></span>
                            <span  title="tex"></span>
                            <span  title="test"></span>
                            <a href="https://www.website.com/a/1973">
                                <img  width="100" height="40"
                                    data-original="https://img.com/i?jpg=123">
                            </a>
                            <span >
                                <a href="https://www.website.com/a/1973">
                                    <span >Violet test</span>
                                </a>
                                <span > 250
                                </span>
                                <a ></a>
                            </span>
                            <span > 2017</span>
                        </div>
                        <div id="n" >
                            <h5>Letter n</h5>
                        </div>

CodePudding user response：

As mentioned in comments, regular expressions are the wrong tool for working with HTML. One approach using a XSLT stylesheet and xsltproc:

example.xslt:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">                                                                                                                                                                   
  <xsl:output method="text" />
  <xsl:template match="/">                                                                                                                                                                                                                        
    <xsl:for-each select="//span[@class='top']">
      <xsl:value-of select="a[@href]/@href" />
      <xsl:text>&#09;</xsl:text>
      <xsl:value-of select="a[@href]/span[@class='title']" />
      <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

Usage:

$ curl -s  https://www.website.com/q?search=violet | xsltproc --html example.xslt -
https://www.website.com/a/1973  Violet test

CodePudding user response：

Suggesting RegExp pattern to match FIRST span class only.

grep -oP '(?<=<span ] '

Tested for your sample:

age0
hsa
Encour
top
title
episode
info

Not sure if that was your intention.

If you need only FIRST span classes closed element in same line.

grep -oP '(?<=<span ] (?=".*</span>)' input.1.txt

Tested for your sample:

age0
hsa
Encour
title
info

CodePudding user response：

Thanks everyone, I do this and it working. May be it's a bad idea but I will see later

(?<=<span >).*?(?=<\/span>)