I'm trying to work with a XML file such as this one:
<clients>
<client>
<name>Bob</name>
<age>18</age>
</client>
<client>
<name>Alice</name>
<age>12</age>
</client>
<client>
<name>Carlos</name>
<age>28</age>
</client>
</clients>
I want to filter only the client tag with age equals to 18. I'm using a command i found googling around that gets all the "client" tags.
sed -n '/client>/,/<\/client/p' test.xml
Resulting this:
<client>
<name>Bob</name>
<age>18</age>
</client>
<client>
<name>Alice</name>
<age>12</age>
</client>
<client>
<name>Carlos</name>
<age>28</age>
</client>
I thought i could do something like that, but it doesn't work as i expected.
sed -n '/client>/(<age>18</age>)/<\/client/p' test.xml
Since it is not an option to download any external tool, i'm trying to use only core commands from Shell.
I'm expecting this result:
<client>
<name>Bob</name>
<age>18</age>
</client>
CodePudding user response:
Here's an awk
idea:
awk -v age=18 '
/<client>/ { in_client = 1; node = "" }
in_client { node = node $0 ORS }
/<\/client>/ { in_client = 0; if ( node ~ "<age>" age "</age>" ) printf "%s", node }
'
The method is to "slurp" each client
node in a string variable and check if that string contains an age
node that matches the "age" value given in argument.
limitations
As "is", it doesn't support:
- multiple
client
nodes in a single line - nested
client
nodes - any additional character in the
<client>
,</client>
,<age>
and</age>
tags.
edit: some explanations
awk
processes the input file(s) line by line; when a line matches a condition, it executes the associated block of code, that's what the condition { code }
are.
-
/<client>/ { in_client = 1; node = "" }
A condition of the form of
/regex/
is a regex match against the whole line (it is equivalent to$0 ~ /regex/
). Therefore,/<client>/
means « when the current line contains<client>
». Then, the corresponding code sets the flagin_client
totrue
(to indicate that we're in a client node) and initialize thenode
variable which will store the entire client node string. -
in_client { node = node $0 ORS }`
When we're in a client node, append the content of current line (i.e.
$0
) to thenode
variable. note: By default,ORS
is a newline character. -
/<\/client>/ { in_client = 0 if ( node ~ "<age>" age "</age>" ) printf "%s", node }
When the current line contains the closing
</client>
tag, it should mean that we have the whole client node stored in thenode
variable, so we have to check if that client has the age that we're looking for (stored in theage
variable); if so, then output the whole client node string. note: The formvar ~ "re" "g" "ex"
(instead ofvar ~ /regex/
) is used when you need to build-up the regex from multiple strings.
CodePudding user response:
Of course using an XPath or XQuery tool is highly preferred over doing pattern matching. However, since the OP asked for an approach with only core tools, you can write a small parser that might work if the input is indeed so simple as in this question.
The OP didn't mention which shell they use, but here an approach in Bash that reads the file line by line, starts collecting a record when it reads <client>
, reads until it finds </client>
, and at that point checks if the search string is found, if yes the record is printed.
Code:
query="<$1>$2</$1>"
parsestatus=0
while read -r line; do
if [ $parsestatus -eq 0 ]; then
# status 0, check for <client>
if echo $line|grep -q "<client>"; then
# client found
result="$line\n"
parsestatus=1
fi
elif [ $parsestatus -eq 1 ]; then
# status 1, check for </client>
if echo $line|grep -q "</client>"; then
# </client> found, check for query
result ="$line\n"
if echo $result|grep -q "$query"; then
printf "$result"
fi
parsestatus=0
else
# line in between <client> and </client>, add to result
result ="$line\n"
fi
fi
done
Example run:
$ bash queryscript.sh age 18 < data.xml
<client>
<name>Bob</name>
<age>18</age>
</client>
Because the script reads from stdin you can pipe a second query on the results of the first. To get for example all records with age=18 that also have name=Bob:
bash queryscript.sh age 18 < data.xml|bash queryscript.sh name Bob
CodePudding user response:
Generally speaking, it is impossible to parse an XML file with regex engines.
You could use Xidel (available here for a few OSs) and a standard XPath expression:
xidel --output-node-format=xml --xpath '//client[age=18]' file.xml
<client>
<name>Bob</name>
<age>18</age>
</client>