Home > OS >  XPath-3 CSV generation
XPath-3 CSV generation

Time:08-04

I'm trying to convert the following XML to CSV using XPath 3.0 (xidel --xpath):

<?xml version="1.0" encoding="utf-8" ?>
<csv>
    <record>
        <field1>A</field1>
        <field2>B</field2>
        <field3>C</field3>
    </record>
    <record>
        <field2> </field2>
        <field3></field3>
    </record>
    <record>
        <field1>,,</field1>
        <field2>""</field2>
        <field3>..</field3>
        <field3>.
                 .</field3>
    </record>
</csv>

My expected output would be:

field1,field2,field3
A,B,C
, ,""
",,","""""",".
                 ."

I've got a few problems (the first one isn't specific to xidel):

  1. I get the fields names with distinct-values(/csv/record/*/name()); how can I use that sequence for extracting the data in the records?

  2. I would like to differentiate between a missing and an empty field but the text() selector of xidel doesn't seem to care about that; is it a XPath feature or a xidel bug?

  3. I can't make return work; does xidel use a different syntax?

Update

I solved #1 myself and #2 was resolved by @ConalTuohy in his comment.

Here's what I got now:

#!/bin/bash
IFS='' read -r -d '' xpath <<'EOF'

let $csv-escape-string := function($str as xs:string) as xs:string {
    if ( matches( $str, ',|"|\n' ) )
    then
        concat('"',replace($str,'"','""'),'"')
    else
        $str
},
$fields-names := distinct-values(/csv/record/*/name()),
$csv := (
    string-join( $fields-names, ',' ),
    /csv/record/string-join(
        (
            for $fn in $fields-names
            return $csv-escape-string(string( *[name()=$fn][last()]/text() ))
        ), ','
    )
)
return $csv

EOF
xidel --xpath "$xpath" file.xml

But the output isn't what I would like it to be:

field1,field2,field3
A,B,C
,,
",,","""""",".
                 ."

Could someone try it with an other XPath-3 processor for making sure that it is xidel that is normalizing text()?

CodePudding user response:

When I ran your code on XML Workbench I got the following result:

A,B,C
, ,
",,","""""",".
                 ."

NB I changed the last line of the query to return string-join($csv, codepoints-to-string(10)) to return the entire CSV as a single string, so XML Workbench didn't quote each string in the result sequence individually, as it otherwise would.

CodePudding user response:

But the output isn't what I would like it to be

Most likely your xidel binary is too old.

xidel --xpath "$xpath" file.xml

The reason you put the query in $xpath is because of the single- and double-quotes, I guess? It sure is a nifty way to circumvent this problem, but not needed if you use double-quotes for strings. --xpath "function('string')" is for , while --xpath 'function("string")' is for . To escape a double-quote within a string simply enter a second one:

$ xidel -s input.xml -e '
  let $field-names:=distinct-values(csv/record/*/name()) return (
    join($field-names,","),
    csv/record/join(
      for $fn in $field-names return
      string(*[name()=$fn][last()]) ! (
        if (matches(.,",|""|\n"))
        then concat("""",replace(.,"""",""""""),"""")
        else .
      ),","
    )
  )
'

Alternatively you can use the XQuery notation:

if (matches(.,",|&quot;|\n"))
then concat("&quot;",replace(.,"&quot;","&quot;&quot;"),"&quot;")
else .

You can also use xidel's own "extended string"-syntax (and use another let-variable instead of !):

$ xidel -s input.xml -e '
  let $field-names:=distinct-values(csv/record/*/name()) return (
    join($field-names,","),
    csv/record/join(
      for $fn in $field-names
      let $val:=string(*[name()=$fn][last()])
      return
      if (matches($val,",|""|\n"))
      then x"""{replace($val,"""","""""")}"""
      else $val,
      ","
    )
  )
'
  • Related