Split up the text in configuration files-CodePudding

I am trying to write a simple parser of configuration files, for example, there might be a file called INCAR:

NSW    = 1000
POTIM  = 1
TEBEG  = 300

If I'd like to extract the value of POTIM, I could adopt awk to extrac the text between blanks and use the following script:

#!/bin/bash

vaspT () 
{ 
    if [ -f INCAR ]; then
        local potim=$(grep POTIM INCAR | awk '{print $3}');
    else
        local potim=1;
    fi;
    echo "# Time step: ${potim}" > .vasp_md.dat;
    echo "# Step  Temperature Total_energy E_pot E_kin" >> .vasp_md.dat;
}

vaspT

but if someone doesn't follow the rule of alignment, and use the configure file like:

NSW = 1000
POTIM=1
TEBEG=300

then, I have to use another delimiter.
My question is:
Is there a simple solution or an existing library (Python or Bash is acceptable)for this kind of job?

CodePudding user response：

You can use regex in this case:

import re
myString = """
NSW = 1000
POTIM       =          1
TEBEG=300
"""
re.findall("POTIM(\s )?\=(\s )?(\d )", myString)

Output

[('       ', '          ', '1')]

If you use regex with this pattern, no matter how many spaces there are, always the last element of the tuple (if any) is the variable you want.

Another Example

import re
myString = """
NSW = 1000
POTIM=1
TEBEG=300
"""
re.findall("POTIM(\s )?\=(\s )?(\d )", myString)

Output

[('', '', '1')]

CodePudding user response：

I would use cut for this use case...

grep POTIM INCAR | cut -d "=" -f 2 | sed s/\ //g

cut -d "=" -f 2 will take the second field in respect to the = delimiter.
sed s/\ //g will remove spaces around the value

CodePudding user response：

You could use awk and set the field separator to = between optional spaces.

If the first field is POTIM, then print the second field.

awk -F"[[:space:]]*=[[:space:]]*" '
$1=="POTIM" {print $2}
' file

Output

CodePudding user response：

You can create a dictionary containing the variables and their value:

import re

with open("filename", "r") as f:
    config = dict(re.findall(r"(\w )\s*=\s*(\w )", f.read()))
print(config)

Output:

{'NSW': '1000', 'POTIM': '1', 'TEBEG': '300'}

You can then retrieve the value of each variable easily:

print(config["POTIM"])  # 1

(\w )\s*=\s*(\w )

(\w ): First capturing group, matches any word character between 1 and unlimited times.
\s*: Matches any whitespace between 0 and unlimited times.
=: Matches =.
\s*: Matches any whitespace between 0 and unlimited times.
(\w ): Second capturing group, matches any word character between 1 and unlimited times.

For each match, re.findall will create a tuple containing the capturing groups. Using dict() will then convert the list to a dictionary.

CodePudding user response：

Using sed

#!/bin/bash

vaspT () 
{ 
    if [ -f INCAR ]; then
        local potim
        potim=$(sed -n '/POTIM/s/.*=[[:space:]]\?\(.*\)/\1/p' INCAR)
    else
        local potim
        potim=1
    fi
    echo "# Time step: ${potim}" > .vasp_md.dat
    echo "# Step  Temperature Total_energy E_pot E_kin" >> .vasp_md.dat
}

vaspT

CodePudding user response：

Is there a simple solution or an existing library(...)Python(...)for this kind of job?

There is configparser in python standard library, but it does assume that there is always header, so you would need to add one if your file has not, consider following example, let file.txt content be

ZERO=0
LEFT =1
RIGHT= 1
BOTH = 2
MULTI  =   3

then it might be used as follows

import configparser
config = configparser.ConfigParser()
with open("file.txt","r") as f:
    config.read_string('[default]\n' f.read())
print(config['default']['ZERO']) # 0
print(config['default']['LEFT']) # 1
print(config['default']['RIGHT']) # 1
print(config['default']['BOTH'])  # 2
print(config['default']['MULTI'])  # 3

Explanation: I add line with default to allow configparser to work. Note that this workaround and you might elect to coerce users into using headers instead of employing this workaround, in which case usage become easier:

import configparser
config = configparser.ConfigParser()
config.read("file.txt")
...

CodePudding user response：

You're doing too much in shell. Awk is the tool that the guys who invented shell also invented for shell to call to manipulate text so just use awk for the whole text manipulation instead of unnecessarily adding other shell commands to feed awk one line at a time, etc.

Your question doesn't tell us what to do if the file exists but doesn't contain a POTIM= line or contains multiple POTIM= lines or how to handle comments in your file (or what those would look like) so ignoring the possibility of comments and guessing that if POTIM= doesn't exist you want to print 1 while if it does exist you want to print the last value seen:

$ cat tst.sh
#!/usr/bin/env bash

vaspT() {
    local infile='INCAR'
    [[ -f "$infile" ]] || infile='/dev/null'

    awk '
        {
            gsub(/^[[:space:]] |[[:space:]] $/,"")
            tag = val = $0
            sub(/[[:space:]]*=.*/,"",tag)
            sub(/[^=]*=[[:space:]]*/,"",val)
            tag2val[tag] = val
        }
        END {
            print "# Time step:", ("POTIM" in tag2val ? tag2val["POTIM"] : 1)
            print "# Step  Temperature Total_energy E_pot E_kin"
        }
    ' "$infile" > .vasp_md.dat
}

vaspT

$ ./tst.sh

$ cat .vasp_md.dat
# Time step: 1
# Step  Temperature Total_energy E_pot E_kin

I use this:

{
    gsub(/^[[:space:]] |[[:space:]] $/,"")
    tag = val = $0
    sub(/[[:space:]]*=.*/,"",tag)
    sub(/[^=]*=[[:space:]]*/,"",val)
    tag2val[tag] = val
}

instead of just:

BEGIN { FS = "[[:space:]]*=[[:space:]]*" }
{ tag2val[$1] = $2 }

so the code would continue to work if there are leading or trailing spaces on the line or the value contained an =, e.g.:

NSW    = 1000
   POTIM  = "foo=bar"  
TEBEG  = 300