Home > database >  Grep exclude count of occurence match between comments <!-- --> of curl body
Grep exclude count of occurence match between comments <!-- --> of curl body


I am very new to linux & bash script. I'm trying to read an xml file using curl command and count the number of occurrence of the word </entity> in it.

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" | grep '</entity>' -oP | wc -l

This works correctly, however the xml file consists of comments like below resulting in wrong count.

Sample XML file


The expected output should be 2 since one of the match is inside the comment block.

CodePudding user response:

Since you're using gnu-grep here is a PCRE regex solution for your problem:

curl -s "https://server:port/app/collection/admin/file?wt=xml&_=12334343432&file=samplefile.xml&contentType=text/xml;charset=utf-8" |
grep -ZzoP '(?s)<!--.*?-->(*SKIP)(*F)|</entity>' |
tr '\0' '\n' |
wc -l


RegEx Demo

RegEx Details:

  • (?s): Enable DOTALL mode so that dot matches line breaks also
  • <!--.*?-->: Match a commented block
  • (*SKIP)(*F): skips and fails this commented block
  • |: OR
  • </entity>: Match </entity> outside commented block
  • tr '\0' '\n': Converts NUL bytes to line break
  • wc -l: Counts number of lines

CodePudding user response:

As usual when dealing with XML, regular expressions are the wrong tool for the job. Use something aware of the format. For example, using xmllint and some XPath:

curl ... | xmllint --xpath 'count(//entity)' -

(Note the trailing -; unlike many programs, xmllint won't automatically read from standard input if not given a filename on the command line)

CodePudding user response:

With your shown samples, please try following awk code. Written and tested in GNU awk.

your_curl_command | 
awk -v RS="" '
    $0=substr($0,RSTART RLENGTH)
  print count

Explanation: Adding detailed explanation for above code.

your_curl_command |                ##Running curl command and sending its output to awk command.
awk -v RS="" '                     ##Setting RS as NULL for this awk program.
match($0,/(^|\n)<!--[^-]*-->/){    ##Using match function of awk where using regex (^|\n)<!--[^-]*-->(explained below)
  val=substr($0,RSTART,RLENGTH)    ##if match of regex is found then assigning sub string value of matched value to val here.
  gsub(val,"")                     ##Using gsub(Global substitution) function to substitute globally val with NULL in current line in whole line.
END{                               ##Starting END block of this awk program from here.
  while(match($0,/(\n|^)[[:space:]]*<entity>[^<]*<\/entity>/)){  ##Using while loop to match regex (\n|^)[[:space:]]*<entity>[^<]*<\/entity> in match function to get all the matches to get count.
    count                          ##Adding 1 to count variable here.
    $0=substr($0,RSTART RLENGTH)   ##Assigning rest of line value to current line to avoid previous match.
  print count                      ##Printing count value here.

Explanation of 1st regex((^|\n)<!--[^-]*-->):

(^|\n)    ##Matching either starting of value OR new line here.
<!--[^-]* ##Followed by <!-- till next value of - here.
-->       ##Followed by --> here.

Explanation of 2nd regex((\n|^)[[:space:]]*<entity>[^<]*<\/entity>):

(\n|^)                ##Matching new line OR starting of value.
[[:space:]]*<entity>  ##Followed by spaces(0 or more occurrence) followed by <entity>
[^<]*                 ##Followed by matching just before <
<\/entity>            ##Followed by </entity> here.

CodePudding user response:

gawk/mawk/mawk2/nawk '
 1      FS = RS = "^$"
 1      _____ = "[<][\\/]entity[>]"
 1      ____ = "\23\4"
 1      ___ =   "\32"
 1      __ = ("[\\n][<][!]")(_="[-][-][\\n]")
 1      sub("......","[\\n]&[>]",_)

# Rule(s)

 1  ($!-_=gsub(_____,"&",
     $((  gsub(__,____)*gsub(_, ___)*\

  • Related