Print XML field values to CSV-CodePudding

I have a directory with loads of XML files that are not formatted in the exact same way but the ones I am after contain the 3 same required fields. I need to know the SNPID of each library as they must be unique. For this I need to list each and their ID value.

Firstly I need to only get results from files containing the value="Kontakt" in the "PoweredBy" field. Then for each I need the Name, SNPID and RegKey to print into a 3 column CSV. eg. Mystica | 547 | Best Service - Mystica

The XML files are like this:

    <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<ProductHints spec="1.0.16">

  <Product version="2">
    <UPID>70026fd5-8f6f-429e-b891-12c2f94bc566</UPID>
    <Name>Mystica</Name>
    <Type>Content</Type>
    <NKSEnabled minVersion="1.1.0.0">true</NKSEnabled>
    <Relevance maxVersion="1.0.9.0" minVersion="1.0.0.0">
      <Application minVersion="5.0.0.0" nativeContent="true">kontakt</Application>
    </Relevance>
    <Relevance minVersion="1.1.0.0">
      <Application minVersion="5.8.0.0" nativeContent="true">kontakt</Application>
      <Application minVersion="2.6.5.0">maschine</Application>
      <Application minVersion="1.8.2.0">kkontrol</Application>
    </Relevance>
    <PoweredBy>Kontakt</PoweredBy>
    <Visibility maxVersion="1.0.9.0" minVersion="1.0.0.0" target="Standalone">1</Visibility>
    <Company>Best Service</Company>
    <AuthSystem>RAS2</AuthSystem>
    <SNPID>547</SNPID>
    <RegKey>Best Service - Mystica</RegKey>
    <Icon>bestservice</Icon>
    <ProductSpecific>
      <HU>496B8CF4F8B1402C4A6650214DF2514C</HU>
      <JDX>C9A2B6D9549FD159D8A3CFF054AAE934C2AC849EC74827847288DF07577A8F22</JDX>
      <Visibility type="Number">3</Visibility>
    </ProductSpecific>
  </Product>

</ProductHints>

I've tried

Computer:~ user$ cd /Library/Application\ Support/Native\ Instruments/Service\ Center 
Computer:Service Center

        while read -r Name SNPID RegKey
do
    echo "                   "
    echo "Name:  ${Name}"
    echo "SNPID: ${SNPID}"
    echo "RegKey: ${RegKey}"
awk -F '[<>]' '{if (FNR%2==1) {printf "%s: ",$3} else {print $3}}'
done

but the results are inconsistent (I know bash isn't great for parsing XML but the requirements are pretty basic).

find -name "*.xml" | xargs cat | tr -d "\n" | sed 's/<\/Name>/\n/g' | sed 's/<\/SNPID>/\n/g' | sed 's/<\/RegKey>/: /g' | sed 's/<[^>]*>//g' | egrep "Name:|SNPID:|RegKey:" | sed 's/Name: /---\nName: /g'

also results in error: find: illegal option -- n

Any BASH wizards out there that can help me please? (If possible I would prefer to use terminal without installing other things) Thanks

CodePudding user response：

Using sed

$ sed -n '/Name\|SNPID\|RegKey/{s/[^>]*>\([^<]*\).*/\1/;H;d};G;s/\n/ | /g;/^ |/{s/^ | \ | //p}' input_file
Mystica | 547 | Best Service - Mystica

CodePudding user response：

This might work for you (GNU sed):

sed -sE '/^\s*<(Name|SNPID|PoweredBy|RegKey)>(.*)<\/\1>\s*$/{s//\1:\2/;H}
         $!d;g;/PoweredBy:Kontakt/!d
         s/Name:([^\n]*)(.*)/\2\n\1/
         s/SNPID:([^\n]*)(.*)/\2|\1/
         s/RegKey:([^\n]*)(.*)/\2|\1/
         s/.*\n//' file ...

Use the options -s to treat each input file separately and -E to simplify regexps.

Using alternation, extract the required fields into the hold buffer.

At the end of the file if the PoweredBy field does not contain Kontakt no further processing is required.

Otherwise, format the fields in the hold buffer into the required format and print the result.

CodePudding user response：

1. Locating required files using `find` and `grep`

 grep -l "PoweredBy>Kontakt" $(find . -name "*.xml")

Validate you get the correct file list.

2. Extracting data, and formating data from required files using `awk`

This awk script, process only the required files one by one.

 awk -F"[><]" '
   $2 == "Name" {ret = $3 " | "}
   $2 == "SNPID" {ret = ret $3 " | "}
   $2 == "RegKey" {print ret $3}
 ' $(grep -l "PoweredBy>Kontakt" $(find . -name "*.xml"))

An alternative using awk without grep command.

 awk -F"[><]" '
   $2 == "PoweredBy" && $3 != "Kontakt" {nextfile} 
   $2 == "Name" {ret = $3 " | "}
   $2 == "SNPID" {ret = ret $3 " | "}
   $2 == "RegKey" {print ret $3}
 ' $(find . -name "*.xml")

CodePudding user response：

Highly recommeded: Don't try to parse xml with anything other than a proper xml parser. For that you can use either xmlstarlet

xml sel -T -t -m "//ProductHints//Product" -v //SNPID/. --nl -v //Name/. --nl -v //RegKey/. --nl your_file.xml

or, even simpler, xidel:

xidel your_file.xml -e '//ProductHints//Product//(Name,SNPID,RegKey)'

Output (based on the sample xml in your question) should be:

547
Mystica
Best Service - Mystica

1. Locating required files using find and grep

2. Extracting data, and formating data from required files using awk

1. Locating required files using `find` and `grep`

2. Extracting data, and formating data from required files using `awk`