Home > Software design >  Unable to parse JSON? Or is it JavaScript text (EDIT: How to parse a custom config file)
Unable to parse JSON? Or is it JavaScript text (EDIT: How to parse a custom config file)

Time:05-17

I am trying to parse what I originally suspected was a JSON config file from a server.

After some attempts I was able to navigate and collapse the sections within Notepad when I selected the formatter as JavaScript.

However I am stuck on how I can convert/parse this data to JSON/another format, no online tools have been able to help with this.

How can I parse this text? Ideally I was trying to use PowerShell, but Python would also be an option if I can figure out how I can even begin the conversion.

Thank you!

For example, I am trying to parse out each of the servers, ie. test1, test2, test3 and get the data listed within each block.

Here is a sample of the config file format:

servername {
  store {
    servers {
      * {
        value<>
        port<>
        folder<C:\windows>
        monitor<yes>
        args<-T -H>
        xrg<store>
        wysargs<-t -g -b>
        accept_any<yes>
        pdu_length<23622>
      }
      test1 {
        name<test1>
        port<123>
        root<c:\test>
        monitor<yes>
      }
      test2 {
        name<test2>
        port<124>
        root<c:\test>
        monitor<yes>
      }
      test3 {
        name<test3>
        port<125>
        root<c:\test>
        monitor<yes>
      }
    }
    senders
    timeout<30>
  }
}

CodePudding user response:

Here's something that converts the above configfile into dict/json in python. I'm just doing some regex as @zett42 suggested.

import re
import json

lines = open('configfile', 'r').read()

# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*] )\s?{', r'"\1": {', lines)
# Process k<v> as Key, Value pairs 
lines3 = re.sub(r'([a-zA-Z\d_*] )\s?<([^<]*)>', r'"\1": "\2"', lines2) 
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*] )\s*$', r'"\1": ""', lines3, flags=re.MULTILINE)

# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)

# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)

# Remove quotes from numerical values
lines7 = re.sub(r'"(\d )"', r'\1', lines6)

# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f] (?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)

# Enclose in brackets and escape backslash for json parsing
lines10 = '{'   lines9.replace('\\', '\\\\')   '}'

j = json.JSONDecoder().decode(lines10)

Edit: Here's an alternative that may be a little cleaner

# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}] )$', r'\1<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<] >)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}] )(?={)', r'"\1":', lines3)
lines5 = re.sub(r'([^:{<>}] )<([^{<>}]*)>', r'"\1":"\2"', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d )"', r'\1', lines7)
# Escape \
lines9 = '{'   re.sub(r'\\', r'\\\\', lines8)   '}'

CodePudding user response:

MYousefi already posted a helpful answer with a Python implementation.

For PowerShell, I've come up with a solution that works without a convert-to-JSON step. Instead, I've adopted and generalized the RegEx-based tokenizer code from Jack Vanlightly (also see related blog post). A tokenizer (aka lexer) splits and categorizes the elements of the input text and outputs a flat stream of tokens (categories) and related data. A parser can use these as input to create a structured representation of the input text.

The tokenizer is written in generic C# and can be used for any input that can be split using RegEx. The C# code is included in PowerShell using the Add-Type command, so no C# compiler is required.

The parser function ConvertFrom-ServerData is written in PowerShell for simplicity. You only use the parser directly, so you don't have to know anything about the tokenizer C# code. If you want to adopt the code to different input, you should only have to modify the PowerShell parser code.

Save the following file in the same directory as the PowerShell script:

"RegExTokenizer.cs":

// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser 
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

namespace DslTokenizer {
    public class DslToken<TokenType> {
        public TokenType Token { get; set; }
        public GroupCollection Groups { get; set; }
    }

    public class TokenMatch<TokenType> {
        public TokenType Token { get; set; }
        public GroupCollection Groups { get; set; }
        public int StartIndex { get; set; }
        public int EndIndex { get; set; }
        public int Precedence { get; set; }
    }

    public class TokenDefinition<TokenType> {
        private Regex _regex;
        private readonly TokenType _returnsToken;
        private readonly int _precedence;

        public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
            _regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
            _returnsToken = returnsToken;
            _precedence = precedence;
        }

        public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {

            foreach( Match match in _regex.Matches( inputString ) ) {
                yield return new TokenMatch<TokenType>() {
                    StartIndex = match.Index,
                    EndIndex   = match.Index   match.Length,
                    Token      = _returnsToken,
                    Groups     = match.Groups,
                    Precedence = _precedence
                };
            }
        }
    }

    public class PrecedenceBasedRegexTokenizer<TokenType> {

        private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();

        public PrecedenceBasedRegexTokenizer() {}

        public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
            _tokenDefinitions = tokenDefinitions.ToList();
        }

        // Easy-to-use interface as alternative to constructor that takes an IEnumerable.
        public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
            _tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
        }

        public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {

            var tokenMatches = FindTokenMatches( lqlText );

            var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
                .OrderBy( x => x.Key )
                .ToList();

            TokenMatch<TokenType> lastMatch = null;

            foreach( var match in groupedByIndex ) {

                var bestMatch = match.OrderBy( x => x.Precedence ).First();
                if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
                    continue;
                }

                yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };

                lastMatch = bestMatch;
            }
        }

        private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {

            var tokenMatches = new List<TokenMatch<TokenType>>();

            foreach( var tokenDefinition in _tokenDefinitions ) {
                tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
            }
            return tokenMatches;
        }
    }        
}

Parser function written in PowerShell:

$ErrorActionPreference = 'Stop'

Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)

Function ConvertFrom-ServerData {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
    )

    begin {
        # Define the kind of possible tokens.
        enum ServerDataTokens {
            ObjectBegin
            ObjectEnd
            ValueInt
            ValueBool
            ValueString
            KeyOnly
        }
        
        # Create an instance of the tokenizer from "RegExTokenizer.cs".
        $tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()

        # Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
        # To resolve ambiguities, most specific RegEx must come first 
        # (e. g. ValueInt line must come before ValueString line).
        # Alternatively pass a 3rd integer parameter that defines the precedence.        
        $tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*] )\s*{' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd,   '^\s*}\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt,    '^\s*(\w )\s*<([ -]?\d )>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool,   '^\s*(\w )\s*<(yes|no)>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w )\s*<(.*)>\s*$' )
        $tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly,     '^\s*(\w )\s*$' )
    }

    process {
        # Output is an ordered hashtable
        $outputObject = [ordered] @{}

        $curObject = $outputObject

        # A stack is used to keep track of nested objects.
        $stack = [Collections.Stack]::new()
        
        # For each token produced by the tokenizer
        $tokenizer.Tokenize( $InputObject ).ForEach{
        
            # $_.Groups[0] is the full match, which we discard by assigning to $null 
            $null, $key, $value = $_.Groups.Value
            
            switch( $_.Token ) {
                ([ServerDataTokens]::ObjectBegin) {  
                    $child = [ordered] @{} 
                    $curObject[ $key ] = $child
                    $stack.Push( $curObject )
                    $curObject = $child
                    break
                }
                ([ServerDataTokens]::ObjectEnd) {
                    $curObject = $stack.Pop()
                    break
                }
                ([ServerDataTokens]::ValueInt) {
                    $intValue = 0
                    $curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
                    break
                }
                ([ServerDataTokens]::ValueBool) {
                    $curObject[ $key ] = $value -eq 'yes'
                    break
                }
                ([ServerDataTokens]::ValueString) {
                    $curObject[ $key ] = $value
                    break
                }
                ([ServerDataTokens]::KeyOnly) {
                    $curObject[ $key ] = $null
                    break
                }
            }
        }

        $outputObject  # Implicit output
    }
}

Usage example:

$sampleData = @'
servername {
    store {
      servers {
        * {
          value<>
          port<>
          folder<C:\windows>
          monitor<yes>
          args<-T -H>
          xrg<store>
          wysargs<-t -g -b>
          accept_any<yes>
          pdu_length<23622>
        }
        test1 {
          name<test1>
          port<123>
          root<c:\test>
          monitor<yes>
        }
        test2 {
          name<test2>
          port<124>
          root<c:\test>
          monitor<yes>
        }
        test3 {
          name<test3>
          port<125>
          root<c:\test>
          monitor<yes>
        }
      }
      senders
      timeout<30>
    }
  }
'@

# Call the parser
$objects = $sampleData | ConvertFrom-ServerData

# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.

$objects.servername.store.servers.GetEnumerator().ForEach{
    "[ SERVER: $($_.Key) ]"
    # Convert server values hashtable to PSCustomObject for better output formatting
    [PSCustomObject] $_.Value | Format-List
}

Output:

[ SERVER: * ]

value      : 
port       : 
folder     : C:\windows
monitor    : True      
args       : -T -H     
xrg        : store     
wysargs    : -t -g -b  
accept_any : True      
pdu_length : 23622     


[ SERVER: test1 ]      

name    : test1        
port    : 123
root    : c:\test      
monitor : True


[ SERVER: test2 ]      

name    : test2        
port    : 124
root    : c:\test      
monitor : True


[ SERVER: test3 ]

name    : test3
port    : 125
root    : c:\test
monitor : True

Notes:

  • If you pass input from Get-Content to the parser, make sure to use parameter -Raw, e. g. $objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData. Otherwise the parser would try to parse each input line on its own.
  • I've opted to convert "yes"/"no" values to bool, so they output as "True"/"False". Remove the line $tokenizer.AddTokenDef( 'ValueBool', ... to parse them as string instead and output as-is.
  • Keys without values <> (the "senders" in the example) are stored as keys with value $null.
  • The RegEx's enforce that values can be single-line only (as the sample data suggests). This allows us to have embedded > characters without the need to escape them.
  • Related