Home > Back-end >  How to extract substring between two strings, INCLUDING the matching/enclosing strings?
How to extract substring between two strings, INCLUDING the matching/enclosing strings?

Time:05-31

I have an XML file and I need to extract an element from the XML file (on Linux machine), but unfortunately, the machine doesn't have the "xsltproc" command (and I cannot get it installed), so I am trying to figure how to do the extract using other tools that are available (e.g., sed, etc.).

Here's an example of the XML:

<l7:List xmlns:l7="http://ns.l7tech.com/2010/04/gateway-management">
    <l7:Name>REVOCATION_CHECK_POLICY List</l7:Name>
    <l7:Type>List</l7:Type>
    <l7:TimeStamp>2022-05-30T12:36:16.994Z</l7:TimeStamp>
    <l7:Link rel="self" uri="https://myhost02.xxxx.com:8443/restman/1.0/revocationCheckingPolicies"/>
    <l7:Link rel="template" uri="https://myhost02.xxxx.com:8443/restman/1.0/revocationCheckingPolicies/template"/>
    <l7:Item>
        <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>
        <l7:Id>a60c5a8714b2e519a6c23192cf09ded5</l7:Id>
        <l7:Type>REVOCATION_CHECK_POLICY</l7:Type>
        <l7:TimeStamp>2022-05-30T12:36:16.994Z</l7:TimeStamp>
        <l7:Link rel="self" uri="https://myhost02.xxxx.com:8443/restman/1.0/revocationCheckingPolicies/a60c5a8714b2e519a6c23192cf09ded5"/>
        <l7:Resource>
            <l7:RevocationCheckingPolicy id="a60c5a8714b2e519a6c23192cf09ded5" version="20">
                <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>
                <l7:DefaultPolicy>true</l7:DefaultPolicy>
                <l7:ContinueOnServerUnavailable>false</l7:ContinueOnServerUnavailable>
                <l7:DefaultSuccess>false</l7:DefaultSuccess>
                <l7:RevocationCheckItems>
                    <l7:Type>OCSP from URL</l7:Type>
                    <l7:Url>http://foo.west.dev.xxxx.com:80</l7:Url>
                    <l7:AllowIssuerSignature>true</l7:AllowIssuerSignature>
                    <l7:TrustedSigners>3c880ed5addceb2e9ef308074f2c353f</l7:TrustedSigners>
                </l7:RevocationCheckItems>
            </l7:RevocationCheckingPolicy>
        </l7:Resource>
    </l7:Item>
</l7:List>    

I need to extract the following XML into a separate variable or file:

            <l7:RevocationCheckingPolicy id="a60c5a8714b2e519a6c23192cf09ded5" version="20">
                <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>
                <l7:DefaultPolicy>true</l7:DefaultPolicy>
                <l7:ContinueOnServerUnavailable>false</l7:ContinueOnServerUnavailable>
                <l7:DefaultSuccess>false</l7:DefaultSuccess>
                <l7:RevocationCheckItems>
                    <l7:Type>OCSP from URL</l7:Type>
                    <l7:Url>http://foo.west.dev.xxxx.com:80</l7:Url>
                    <l7:AllowIssuerSignature>true</l7:AllowIssuerSignature>
                    <l7:TrustedSigners>3c880ed5addceb2e9ef308074f2c353f</l7:TrustedSigners>
                </l7:RevocationCheckItems>
            </l7:RevocationCheckingPolicy>

I am able to "flatten" the file into a single string using:

sed ':a;N;$!ba;s/\n//g' response.xml

but then I have to try to extract the string that I need (between the:

<l7:RevocationCheckingPolicy

and the:

</l7:RevocationCheckingPolicy>

INCLUSIVE of the two matching strings.

I can extract the subtring WITHOUT the matching strings:

sed ':a;N;$!ba;s/\n//g' response.xml | sed -e 's/.*<l7\:RevocationCheckingPolicy\(.*\)<\/l7\:RevocationCheckingPolicy>.*/\1/'

which gives me:

 id="a60c5a8714b2e519a6c23192cf09ded5" version="20">                            <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>                    <l7:DefaultPolicy>true</l7:DefaultPolicy>                    <l7:ContinueOnServerUnavailable>false</l7:ContinueOnServerUnavailable>                    <l7:DefaultSuccess>false</l7:DefaultSuccess>                    <l7:RevocationCheckItems>                        <l7:Type>OCSP from URL</l7:Type>                        <l7:Url>http://foo.west.dev.xxxx.com:80</l7:Url>                        <l7:AllowIssuerSignature>true</l7:AllowIssuerSignature>                        <l7:TrustedSigners>3c880ed5addceb2e9ef308074f2c353f</l7:TrustedSigners>                    </l7:RevocationCheckItems>

But that XML is missing the enclosing strings:

<l7:RevocationCheckingPolicy

at the beginning and the:

</l7:RevocationCheckingPolicy> 

I've seen some suggestion of just including the enclosing strings in front of, and after, the:

.*

in the 2nd sed, but when I tried that, it seems like that is causing the 2nd sed to not even match at all.

Can someone tell me how to include the enclosing strings?

Thanks, Jim

CodePudding user response:

Using sed

$ sed -n '/<l7:RevocationCheckingPolicy/,\|</l7:RevocationCheckingPolicy>|p' input_file > outfile
$ cat outfile
            <l7:RevocationCheckingPolicy id=a60c5a8714b2e519a6c23192cf09ded5 version=20>
                <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>
                <l7:DefaultPolicy>true</l7:DefaultPolicy>
                <l7:ContinueOnServerUnavailable>false</l7:ContinueOnServerUnavailable>
                <l7:DefaultSuccess>false</l7:DefaultSuccess>
                <l7:RevocationCheckItems>
                    <l7:Type>OCSP from URL</l7:Type>
                    <l7:Url>http://foo.west.dev.xxxx.com:80</l7:Url>
                    <l7:AllowIssuerSignature>true</l7:AllowIssuerSignature>
                    <l7:TrustedSigners>3c880ed5addceb2e9ef308074f2c353f</l7:TrustedSigners>
                </l7:RevocationCheckItems>
            </l7:RevocationCheckingPolicy>

CodePudding user response:

For what it's worth, get the required element with a xml tool as xmllint.
Advantages: works regardless of namespace prefix and XML file format.

xmllint --xpath '//*[local-name()="RevocationCheckingPolicy"]' test.xml

Result

<l7:RevocationCheckingPolicy id="a60c5a8714b2e519a6c23192cf09ded5" version="20">
        <l7:Name>OCSPREVOCATIONVALIDATION</l7:Name>
        <l7:DefaultPolicy>true</l7:DefaultPolicy>
        <l7:ContinueOnServerUnavailable>false</l7:ContinueOnServerUnavailable>
        <l7:DefaultSuccess>false</l7:DefaultSuccess>
        <l7:RevocationCheckItems>
          <l7:Type>OCSP from URL</l7:Type>
          <l7:Url>http://foo.west.dev.xxxx.com:80</l7:Url>
          <l7:AllowIssuerSignature>true</l7:AllowIssuerSignature>
          <l7:TrustedSigners>3c880ed5addceb2e9ef308074f2c353f</l7:TrustedSigners>
        </l7:RevocationCheckItems>
      </l7:RevocationCheckingPolicy>

CodePudding user response:

Suggesting awk script:

awk '/<l7:RevocationCheckingPolicy /,/l7:RevocationCheckingPolicy>/' input.xml
  • Related