I am trying to parse what I originally suspected was a JSON config file from a server.
After some attempts I was able to navigate and collapse the sections within Notepad when I selected the formatter as JavaScript.
However I am stuck on how I can convert/parse this data to JSON/another format, no online tools have been able to help with this.
How can I parse this text? Ideally I was trying to use PowerShell, but Python would also be an option if I can figure out how I can even begin the conversion.
For example, I am trying to parse out each of the servers, ie. test1, test2, test3 and get the data listed within each block.
Here is a sample of the config file format:
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
CodePudding user response:
Here's something that converts the above configfile into dict/json in python. I'm just doing some regex as @zett42 suggested.
import re
import json
lines = open('configfile', 'r').read()
# Quotations around the keys (next 3 lines)
lines2 = re.sub(r'([a-zA-Z\d_*] )\s?{', r'"\1": {', lines)
# Process k<v> as Key, Value pairs
lines3 = re.sub(r'([a-zA-Z\d_*] )\s?<([^<]*)>', r'"\1": "\2"', lines2)
# Process single key word on the line as Key, value pair with empty value
lines4 = re.sub(r'^\s*([a-zA-Z\d_*] )\s*$', r'"\1": ""', lines3, flags=re.MULTILINE)
# Insert replace \n with commas in lines ending with "
lines5 = re.sub(r'"\n', '",', lines4)
# Remove the comma before the closing bracket
lines6 = re.sub(r',\s*}', '}', lines5)
# Remove quotes from numerical values
lines7 = re.sub(r'"(\d )"', r'\1', lines6)
# Add commas after closing brackets when needed
lines8 = re.sub(r'[ \t\r\f] (?!-)', '', lines7)
lines9 = re.sub(r'(?<=})\n(?=")', r",\n", lines8)
# Enclose in brackets and escape backslash for json parsing
lines10 = '{' lines9.replace('\\', '\\\\') '}'
j = json.JSONDecoder().decode(lines10)
Edit: Here's an alternative that may be a little cleaner
# Replace line with just key with key<>
lines2 = re.sub(r'^([^{<>}] )$', r'\1<>', lines, flags=re.MULTILINE)
# Remove spaces not within <>
lines3 = re.sub(r'\s(?!.*?>)|\s(?![^<] >)', '', lines2, flags=re.MULTILINE)
# Quotations
lines4 = re.sub(r'([^{<>}] )(?={)', r'"\1":', lines3)
lines5 = re.sub(r'([^:{<>}] )<([^{<>}]*)>', r'"\1":"\2"', lines4)
# Add commas
lines6 = re.sub(r'(?<=")"(?!")', ',"', lines5)
lines7 = re.sub(r'}(?!}|$)', '},', lines6)
# Remove quotes from numbers
lines8 = re.sub(r'"(\d )"', r'\1', lines7)
# Escape \
lines9 = '{' re.sub(r'\\', r'\\\\', lines8) '}'
CodePudding user response:
I've come up with an even simpler solution than my previous one which uses PowerShell code only.
Using the RegEx alternation operator |
we combine all token patterns into a single pattern and use named subexpressions to determine which one has actually matched.
The rest of the code is structurally similar to the C#/PS version.
using namespace System.Text.RegularExpressions
$ErrorActionPreference = 'Stop'
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Key can consist of anything except whitespace and < > { }
$keyPattern = '[^\s<>{}] '
# Order of the patterns is important
$pattern = (
"(?<IntKey>$keyPattern)\s*<(?<IntValue>\d )>",
"(?<TrueKey>$keyPattern)\s*<yes>",
"(?<FalseKey>$keyPattern)\s*<no>",
"(?<StrKey>$keyPattern)\s*<(?<StrValue>.*?)>",
"(?<ObjectBegin>$keyPattern)\s*{",
"(?<ObjectEnd>})",
"(?<KeyOnly>$keyPattern)",
"(?<Invalid>\S )" # any non-whitespace sequence that didn't match the valid patterns
) -join '|'
}
process {
# Output is an ordered hashtable
$curObject = $outputObject = [ordered] @{}
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each pattern match
foreach( $match in [RegEx]::Matches( $InputObject, $pattern, [RegexOptions]::Multiline ) ) {
# Get the RegEx groups that have actually matched.
$matchGroups = $match.Groups.Where{ $_.Success -and $_.Name.Length -gt 1 }
$key = $matchGroups[ 0 ].Value
switch( $matchGroups[ 0 ].Name ) {
'ObjectBegin' {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
'ObjectEnd' {
if( $stack.Count -eq 0 ) {
Write-Error -EA Stop "Parse error: Curly braces are unbalanced. There are more '}' than '{' in config data."
}
$curObject = $stack.Pop()
break
}
'IntKey' {
$value = $matchGroups[ 1 ].Value
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
'TrueKey' {
$curObject[ $key ] = $true
break
}
'FalseKey' {
$curObject[ $key ] = $false
break
}
'StrKey' {
$value = $matchGroups[ 1 ].Value
$curObject[ $key ] = $value
break
}
'KeyOnly' {
$curObject[ $key ] = $null
break
}
'Invalid' {
Write-Warning "Invalid token at index $($match.Index): $key"
break
}
}
}
if( $stack.Count -gt 0 ) {
Write-Error "Parse error: Curly braces are unbalanced. There are more '{' than '}' in config data."
}
$outputObject # Implicit output
}
}
Usage example:
$sampleData = @'
test-server {
store {
servers {
* {
value<>
port<>
folder<C:\windows> monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# Uncomment to verify the whole result
#$objects | ConvertTo-Json -Depth 10
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.'test-server'.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
Output:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
Notes:
- I have further relaxed the regular expressions. Keys may now consist of any character except whitespace,
<
,>
,{
and}
. - Line breaks are no longer required. This is more flexible but you can't have strings with embedded
>
characters. Let me know if this is a problem. - I have added detection of invalid tokens, which are output as warnings. Remove the
"(?<Invalid>\S )"
line, if you want to ignore invalid tokens instead. - Unbalanced curly braces are detected and reported as error.
- You can see how the RegEx works and get explanations at RegEx101.
CodePudding user response:
Edit: I've since come up with a much simpler, PowerShell-only solution which I recommend to use.
I'll keep this answer alive as it might still be useful for other scenarios. Also there are propably differences in performance (I haven't measured).
MYousefi already posted a helpful answer with a Python implementation.
For PowerShell, I've come up with a solution that works without a convert-to-JSON step. Instead, I've adopted and generalized the RegEx-based tokenizer code from Jack Vanlightly (also see related blog post). A tokenizer (aka lexer) splits and categorizes the elements of the input text and outputs a flat stream of tokens (categories) and related data. A parser can use these as input to create a structured representation of the input text.
The tokenizer is written in generic C# and can be used for any input that can be split using RegEx. The C# code is included in PowerShell using the Add-Type
command, so no C# compiler is required.
The parser function ConvertFrom-ServerData
is written in PowerShell for simplicity. You only use the parser directly, so you don't have to know anything about the tokenizer C# code. If you want to adopt the code to different input, you should only have to modify the PowerShell parser code.
Save the following file in the same directory as the PowerShell script:
"RegExTokenizer.cs":
// Generic, precedence-based RegEx tokenizer.
// This code is based on https://github.com/Vanlightly/DslParser
// from Jack Vanlightly (https://jack-vanlightly.com).
// Modifications:
// - Interface improved for ease-of-use from PowerShell.
// - Return all groups from the RegEx match instead of just the value. This simplifies parsing of key/value pairs by requiring only a single token definition.
// - Some code simplifications, e. g. replacing "for" loops by "foreach".
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace DslTokenizer {
public class DslToken<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
}
public class TokenMatch<TokenType> {
public TokenType Token { get; set; }
public GroupCollection Groups { get; set; }
public int StartIndex { get; set; }
public int EndIndex { get; set; }
public int Precedence { get; set; }
}
public class TokenDefinition<TokenType> {
private Regex _regex;
private readonly TokenType _returnsToken;
private readonly int _precedence;
public TokenDefinition( TokenType returnsToken, string regexPattern, int precedence ) {
_regex = new Regex( regexPattern, RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.Compiled );
_returnsToken = returnsToken;
_precedence = precedence;
}
public IEnumerable<TokenMatch<TokenType>> FindMatches( string inputString ) {
foreach( Match match in _regex.Matches( inputString ) ) {
yield return new TokenMatch<TokenType>() {
StartIndex = match.Index,
EndIndex = match.Index match.Length,
Token = _returnsToken,
Groups = match.Groups,
Precedence = _precedence
};
}
}
}
public class PrecedenceBasedRegexTokenizer<TokenType> {
private List<TokenDefinition<TokenType>> _tokenDefinitions = new List<TokenDefinition<TokenType>>();
public PrecedenceBasedRegexTokenizer() {}
public PrecedenceBasedRegexTokenizer( IEnumerable<TokenDefinition<TokenType>> tokenDefinitions ) {
_tokenDefinitions = tokenDefinitions.ToList();
}
// Easy-to-use interface as alternative to constructor that takes an IEnumerable.
public void AddTokenDef( TokenType returnsToken, string regexPattern, int precedence = 0 ) {
_tokenDefinitions.Add( new TokenDefinition<TokenType>( returnsToken, regexPattern, precedence ) );
}
public IEnumerable<DslToken<TokenType>> Tokenize( string lqlText ) {
var tokenMatches = FindTokenMatches( lqlText );
var groupedByIndex = tokenMatches.GroupBy( x => x.StartIndex )
.OrderBy( x => x.Key )
.ToList();
TokenMatch<TokenType> lastMatch = null;
foreach( var match in groupedByIndex ) {
var bestMatch = match.OrderBy( x => x.Precedence ).First();
if( lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex ) {
continue;
}
yield return new DslToken<TokenType>(){ Token = bestMatch.Token, Groups = bestMatch.Groups };
lastMatch = bestMatch;
}
}
private List<TokenMatch<TokenType>> FindTokenMatches( string lqlText ) {
var tokenMatches = new List<TokenMatch<TokenType>>();
foreach( var tokenDefinition in _tokenDefinitions ) {
tokenMatches.AddRange( tokenDefinition.FindMatches( lqlText ).ToList() );
}
return tokenMatches;
}
}
}
Parser function written in PowerShell:
$ErrorActionPreference = 'Stop'
Add-Type -TypeDefinition (Get-Content $PSScriptRoot\RegExTokenizer.cs -Raw)
Function ConvertFrom-ServerData {
[CmdletBinding()]
param (
[Parameter(Mandatory, ValueFromPipeline)] [string] $InputObject
)
begin {
# Define the kind of possible tokens.
enum ServerDataTokens {
ObjectBegin
ObjectEnd
ValueInt
ValueBool
ValueString
KeyOnly
}
# Create an instance of the tokenizer from "RegExTokenizer.cs".
$tokenizer = [DslTokenizer.PrecedenceBasedRegexTokenizer[ServerDataTokens]]::new()
# Define a RegEx for each token where 1st group matches key and 2nd matches value (if any).
# To resolve ambiguities, most specific RegEx must come first
# (e. g. ValueInt line must come before ValueString line).
# Alternatively pass a 3rd integer parameter that defines the precedence.
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectBegin, '^\s*([\w*] )\s*{' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ObjectEnd, '^\s*}\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueInt, '^\s*(\w )\s*<([ -]?\d )>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueBool, '^\s*(\w )\s*<(yes|no)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::ValueString, '^\s*(\w )\s*<(.*)>\s*$' )
$tokenizer.AddTokenDef( [ServerDataTokens]::KeyOnly, '^\s*(\w )\s*$' )
}
process {
# Output is an ordered hashtable
$outputObject = [ordered] @{}
$curObject = $outputObject
# A stack is used to keep track of nested objects.
$stack = [Collections.Stack]::new()
# For each token produced by the tokenizer
$tokenizer.Tokenize( $InputObject ).ForEach{
# $_.Groups[0] is the full match, which we discard by assigning to $null
$null, $key, $value = $_.Groups.Value
switch( $_.Token ) {
([ServerDataTokens]::ObjectBegin) {
$child = [ordered] @{}
$curObject[ $key ] = $child
$stack.Push( $curObject )
$curObject = $child
break
}
([ServerDataTokens]::ObjectEnd) {
$curObject = $stack.Pop()
break
}
([ServerDataTokens]::ValueInt) {
$intValue = 0
$curObject[ $key ] = if( [int]::TryParse( $value, [ref] $intValue ) ) { $intValue } else { $value }
break
}
([ServerDataTokens]::ValueBool) {
$curObject[ $key ] = $value -eq 'yes'
break
}
([ServerDataTokens]::ValueString) {
$curObject[ $key ] = $value
break
}
([ServerDataTokens]::KeyOnly) {
$curObject[ $key ] = $null
break
}
}
}
$outputObject # Implicit output
}
}
Usage example:
$sampleData = @'
servername {
store {
servers {
* {
value<>
port<>
folder<C:\windows>
monitor<yes>
args<-T -H>
xrg<store>
wysargs<-t -g -b>
accept_any<yes>
pdu_length<23622>
}
test1 {
name<test1>
port<123>
root<c:\test>
monitor<yes>
}
test2 {
name<test2>
port<124>
root<c:\test>
monitor<yes>
}
test3 {
name<test3>
port<125>
root<c:\test>
monitor<yes>
}
}
senders
timeout<30>
}
}
'@
# Call the parser
$objects = $sampleData | ConvertFrom-ServerData
# The parser outputs nested hashtables, so we have to use GetEnumerator() to
# iterate over the key/value pairs.
$objects.servername.store.servers.GetEnumerator().ForEach{
"[ SERVER: $($_.Key) ]"
# Convert server values hashtable to PSCustomObject for better output formatting
[PSCustomObject] $_.Value | Format-List
}
Output:
[ SERVER: * ]
value :
port :
folder : C:\windows
monitor : True
args : -T -H
xrg : store
wysargs : -t -g -b
accept_any : True
pdu_length : 23622
[ SERVER: test1 ]
name : test1
port : 123
root : c:\test
monitor : True
[ SERVER: test2 ]
name : test2
port : 124
root : c:\test
monitor : True
[ SERVER: test3 ]
name : test3
port : 125
root : c:\test
monitor : True
Notes:
- If you pass input from
Get-Content
to the parser, make sure to use parameter-Raw
, e. g.$objects = Get-Content input.cfg -Raw | ConvertFrom-ServerData
. Otherwise the parser would try to parse each input line on its own. - I've opted to convert "yes"/"no" values to
bool
, so they output as "True"/"False". Remove the line$tokenizer.AddTokenDef( 'ValueBool', ...
to parse them asstring
instead and output as-is. - Keys without values
<>
(the "senders" in the example) are stored as keys with value$null
. - The RegEx's enforce that values can be single-line only (as the sample data suggests). This allows us to have embedded
>
characters without the need to escape them.