Home > Software engineering >  sed - specific regex(s) based on prior given user "options"
sed - specific regex(s) based on prior given user "options"

Time:08-11

I am currently trying to merge several xml files via the following code:

rex_xh="-e '/^ *<\?xml[^>]*>$/d' -e s/^ *<\?xml[^>] >//'"
rex_el="-e '/^[[:space:]]*$/d'"
rex_ts="-e "'s/^[ \t]*//'
while read xmldat ; do
        cat $xmldat | sed $rex_xh $rex_el $rex_ts >> "$OUTDIR/$OUTFILE" ; 
done << "$files"

which should essentially be executed (for each file) as:
cat $xmldat | sed -e '/^ *<\?xml[^>]*>$/d' -e s/^ *<\?xml[^>] >//' -e '/^[[:space:]]*$/d' -e "'s/^[ \t]*// >> "$OUTDIR/$OUTFILE"

However when trying to execute this, i get this error message: sed: -e expression #1, char 1: unknown command: `'

If I execute the command without the variables and instead enter the sed commands directly, it works fine. What am i missing? Am I doing something wrong with the variable expansion?

Based on (later given) user input, all 3, only 2 or only 1 of the given regular expressions should be used on the file(s). The current setup should -remove xml headers -remove empty lines -remove tabs and spaces in the beginning of new lines.

Input Example

<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?>
<RootNode xmlns="http://stub/example">
        
    <ExampleBase someattr="val">
                
        <InnerNode>Example</InnerNode>

    <ExampleBase someattr="val">

</RootNode>
                

Expected result (when header removal, space removal and empty line removal is wanted)

<RootNode xmlns="http://stub/example">  
<ExampleBase someattr="val">
<InnerNode>Example</InnerNode>
<ExampleBase someattr="val">
</RootNode>
                

Expected result (when only space removal and empty line removal is wanted)

<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?>
<RootNode xmlns="http://stub/example">  
<ExampleBase someattr="val">
<InnerNode>Example</InnerNode>
<ExampleBase someattr="val">
</RootNode>
                

Input Example 2

<?xml version="1.0" encoding="ISO-8859-15" standalone="no"?><RootNode xmlns="http://stub/example"><ExampleBase someattr="val"><InnerNode>Example
                              </InnerNode>

    <ExampleBase someattr="val">

</RootNode>
                

(And yes, we get that kind of weird formatted xml)

Expected result (when header removal, space removal and empty line removal is wanted)

<RootNode xmlns="http://stub/example"><ExampleBase someattr="val"><InnerNode>Example
</InnerNode>
<ExampleBase someattr="val">
</RootNode>
                

Notes:

  • The files are not always valid xml files, hence I cant use xmllint or other xml tools
    • e.g. no closing tag
  • The header dows not always is alone on the first line, sometimes it is not succeeded by a line break.
  • The different regexes (e.g rex_xh) will later be optional and controlled by user input, hence the "necessity" to wrap them in variables
  • In the future it should be easy, to add new "options", hence another reason for using the "options" in variables.

Can anyone help me out here?

CodePudding user response:

Please try following awk code to deal with few edge cases added by OP into the question now. Written and tested with shown samples only in GNU awk.

awk -v RS="^$" '
match($0,/^<\?xml version="[^"]*" encoding="[^"]*" standalone="[^"]*"\?>/){
  val=substr($0,RSTART RLENGTH)
  gsub(/\n/,"",val)
  gsub(/>[[:space:]]*</,">\n<",val)
  gsub(/[[:space:]] </,"<",val)
  gsub(/>[[:space:]]*</,">\n<",val)
  print val
}
'  Input_file

Explanation: Simple explanation would be, using 2 conditions in awk program. 1st: If a line is NOT having value(matching by regex ^<\?xml version="[^"]*" encoding="[^"]*" standalone="[^"]*"\?>$) AND its NOT NULL, then using gsub function(s) to get output as per need and print that line's value present in val variable..



EDIT by OP - Implemented Solution After fiddling around and due to the help,comments and answer from @RavinderSingh13 The following code is the final solution(snippet for the important part):

rm_xmlhead=1;  # Option given via user input (later)
rm_tabspac=1;  # Option given via user input (later)
rm_emptyln=1;  # Option given via user input (later)
while read xmldat ; do
  cat $xmldat | awk -v rem_xh=$rm_xmlhead -v rem_ts=$rm_tabspac -v rem_el=$rm_emptyln ' {
          if(rem_xh) { sub(/^ *<\?xml[^>] >/,"") }
          if(rem_ts) { sub(/^[[:space:]] /,"") }
          if(rem_el && $0 =="" ) {next}
          print
      }' >> "$OUTPUT" ; 
done << "$files"

This removes empty lines, leading spaces and tabs, xml headers and is easily expandable if any "new" requirements arise...also it gives me the possibility to later make each of the removals optional.

  • Related