Using /bin/bash on MacOS, how do I use regex to extract a sentence before a word and after a word?-CodePudding

I'm trying to get just the text "Passwords do not match" between <Description> and </Description> from the variable $webout using regex. I'm brand new to regex, so please explain in detail the solution and how to format it within the bash script so I can learn.

Text from $webout Variable:

<?xml version="1.0" encoding="utf-16"?><interface-response><Command>SETDNSHOST</Command><Language>eng</Language><ErrCount>1</ErrCount><errors><Err1>Passwords do not match</Err1></errors><ResponseCount>1</ResponseCount><responses><response><Description>Passwords do not match</Description><ResponseNumber>304156</ResponseNumber><ResponseString>Validation error; invalid ; password</ResponseString></response></responses><Done>true</Done><debug><![CDATA[]]></debug></interface-response>

Script:

#!/bin/bash
url=ifconfig.me
pip=$(curl -s ${url})
upip="https://dynamicdns.park-your-domain.com/update?host=[hostname]&domain=[domain.com]&password=[password]&ip=${pip}"
webout=$(curl -s $upip)
echo $webout(<Description>(.*?)<)
#echo $(date  '%D %H:%M') $pip >> /users/username/documents/itworks.txt

The problems i've ran into I believe is caused by the "/" in </Description>. That and i'm having a very difficult time grasping regex formatting.

Thank you

CodePudding user response：

With xmlstarlet

# this is a placeholder for your curl call
webout='<?xml version="1.0" encoding="utf-16"?><interface-response><Command>SETDNSHOST</Command><Language>eng</Language><ErrCount>1</ErrCount><errors><Err1>Passwords do not match</Err1></errors><ResponseCount>1</ResponseCount><responses><response><Description>Passwords do not match</Description><ResponseNumber>304156</ResponseNumber><ResponseString>Validation error; invalid ; password</ResponseString></response></responses><Done>true</Done><debug><![CDATA[]]></debug></interface-response>'

desc=$(
    echo "$webout" \
    | iconv -f utf-8 -t utf-16 \
    | xmlstarlet sel -t -v //Description
)

declare -p desc

outputs

declare -- desc="Passwords do not match"

iconv was needed to avoid "Document labelled UTF-16 but has UTF-8 content" error (from copy-pasting your sample data, YMMV)

CodePudding user response：

Here's two bash examples:

webout='<?xml version="1.0" encoding="utf-16"?><interface-response><Command>SETDNSHOST</Command><Language>eng</Language><ErrCount>1</ErrCount><errors><Err1>Passwords do not match</Err1></errors><ResponseCount>1</ResponseCount><responses><response><Description>Passwords do not match</Description><ResponseNumber>304156</ResponseNumber><ResponseString>Validation error; invalid ; password</ResponseString></response></responses><Done>true</Done><debug><![CDATA[]]></debug></interface-response>'


sed -n  "s:.*<Description>\(.*\)</Description>.*:\1:p" <<< $webout


grep -oP '(?<=<Description>).*(?=</Description>)' <<< $webout

To be honest, the second command (grep) is using a syntax that I'm not too familiar with (I just picked it up here: https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match ).

However, when you are parsing XML, you are better off not using regex but rather an xml parser.

Here's a third option using an XML parse (xmllint):

xmllint --xpath '//Description/text()' -  <<< $webout

Note: I had to change utf-16 to utf-8 to make xmllint happy.

After I read comments and other answers, on this page, I discovered that iconv is the command for converting from UTF-8 to UTF-16. Here's an improved version:

xmllint --xpath '//Description/text()' <(  iconv -f utf-8  -t utf-16 <<< $webout )