Home > OS >  Get the lines of a file between two matching patterns which contains a specific pattern
Get the lines of a file between two matching patterns which contains a specific pattern

Time:01-18

I'm trying to work with a XML file such as this one:

<clients>
    <client>
        <name>Bob</name>
        <age>18</age>
    </client>
    <client>
        <name>Alice</name>
        <age>12</age>
    </client>
    <client>
        <name>Carlos</name>
        <age>28</age>
    </client>
</clients>

I want to filter only the client tag with age equals to 18. I'm using a command i found googling around that gets all the "client" tags.

sed -n '/client>/,/<\/client/p' test.xml

Resulting this:

    <client>
        <name>Bob</name>
        <age>18</age>
    </client>
    <client>
        <name>Alice</name>
        <age>12</age>
    </client>
    <client>
        <name>Carlos</name>
        <age>28</age>
    </client>

I thought i could do something like that, but it doesn't work as i expected.

sed -n '/client>/(<age>18</age>)/<\/client/p' test.xml

Since it is not an option to download any external tool, i'm trying to use only core commands from Shell.

I'm expecting this result:

    <client>
        <name>Bob</name>
        <age>18</age>
    </client>

CodePudding user response:

Here's an awk idea:

awk -v age=18 '
    /<client>/   { in_client = 1; node = "" }
    in_client    { node = node $0 ORS }
    /<\/client>/ { in_client = 0; if ( node ~ "<age>" age "</age>" ) printf "%s", node }
'

The method is to "slurp" each client node in a string variable and check if that string contains an age node that matches the "age" value given in argument.

limitations

As "is", it doesn't support:

  • multiple client nodes in a single line
  • nested client nodes
  • any additional character in the <client>, </client>, <age> and </age> tags.

edit: some explanations

awk processes the input file(s) line by line; when a line matches a condition, it executes the associated block of code, that's what the condition { code } are.

  1. /<client>/ { in_client = 1; node = "" }
    

    A condition of the form of /regex/ is a regex match against the whole line (it is equivalent to $0 ~ /regex/). Therefore, /<client>/ means « when the current line contains <client> ». Then, the corresponding code sets the flag in_client to true (to indicate that we're in a client node) and initialize the node variable which will store the entire client node string.

  2. in_client { node = node $0 ORS }`
    

    When we're in a client node, append the content of current line (i.e. $0) to the node variable. note: By default, ORS is a newline character.

  3. /<\/client>/ {
        in_client = 0
        if ( node ~ "<age>" age "</age>" )
            printf "%s", node
    }
    

    When the current line contains the closing </client> tag, it should mean that we have the whole client node stored in the node variable, so we have to check if that client has the age that we're looking for (stored in the age variable); if so, then output the whole client node string. note: The form var ~ "re" "g" "ex" (instead of var ~ /regex/) is used when you need to build-up the regex from multiple strings.

CodePudding user response:

Of course using an XPath or XQuery tool is highly preferred over doing pattern matching. However, since the OP asked for an approach with only core tools, you can write a small parser that might work if the input is indeed so simple as in this question.

The OP didn't mention which shell they use, but here an approach in Bash that reads the file line by line, starts collecting a record when it reads <client>, reads until it finds </client>, and at that point checks if the search string is found, if yes the record is printed.

Code:

query="<$1>$2</$1>"

parsestatus=0
while read -r line; do
if [ $parsestatus -eq 0 ]; then
   # status 0, check for <client>
   if echo $line|grep -q "<client>"; then
      # client found
      result="$line\n"
      parsestatus=1
   fi
elif [ $parsestatus -eq 1 ]; then
   # status 1, check for </client>
   if echo $line|grep -q "</client>"; then
      # </client> found, check for query
      result ="$line\n"
      if echo $result|grep -q "$query"; then
         printf "$result"
      fi
      parsestatus=0
   else
      # line in between <client> and </client>, add to result
      result ="$line\n"
   fi
fi
done

Example run:

$ bash queryscript.sh age 18 < data.xml
<client>
<name>Bob</name>
<age>18</age>
</client>

Because the script reads from stdin you can pipe a second query on the results of the first. To get for example all records with age=18 that also have name=Bob:

bash queryscript.sh age 18 < data.xml|bash queryscript.sh name Bob

CodePudding user response:

Generally speaking, it is impossible to parse an XML file with regex engines.

You could use Xidel (available here for a few OSs) and a standard XPath expression:

xidel --output-node-format=xml --xpath '//client[age=18]' file.xml
<client>
        <name>Bob</name>
        <age>18</age>
    </client>
  • Related