Parsing multiple lines from a txt output-CodePudding

I need to parse a part of my output file that looks like this (Image is also attached for clarity)

  ATOM      1 c           2 c           3 c           4 n           5 h   
dE/dx -0.1150239D-01  0.2259669D-01 -0.3153006D-02 -0.2718508D-01 -0.1064344D-01
dE/dy  0.4798462D-02  0.5902019D-01 -0.4060517D-01  0.3404657D-01 -0.1054522D-01
dE/dz  0.6015413D-05  0.3707704D-02 -0.2306249D-02  0.1334956D-02 -0.8032586D-03

  ATOM      6 c           7 s           8 n           9 h          10 c
dE/dx  0.3017851D-01 -0.2253417D-01 -0.3195785D-01 -0.4441489D-02  0.8337613D-02
dE/dy -0.2805275D-01  0.1196856D-01  0.1888257D-01  0.1382483D-01 -0.8171767D-01
dE/dz -0.9413310D-03  0.2069422D-03  0.3914382D-03  0.6724659D-03 -0.4316928D-02

  ATOM     11 s          12 h          13 h          14 h          15 h
dE/dx  0.3138990D-01  0.8416159D-02  0.1128067D-02  0.8941240D-02  0.4292434D-03
dE/dy  0.3893252D-01  0.3335059D-02 -0.1457401D-01  0.4869915D-02 -0.1418384D-01
dE/dz  0.2767787D-02  0.1357569D-01 -0.7834375D-03 -0.1273530D-01 -0.7764904D-03

 resulting FORCE  (fx,fy,fz) = (-.874D-10,0.110D-08,0.562D-10)
 resulting MOMENT (mx,my,mz) = (-.504D-04,-.543D-04,-.538D-03)

The expected output is :

{'1c_ddz': '-0.3845687D-01', '1c_ddy': '0.2170984D-02', and etc)

the code I have so far looks like below:

class NACParser(ParseSection):
    name = "coupling"

    nac_coupling = SimpleLineParser(r" cartesian\s nonadiabatic\s coupling\s matrix\s elements\s \((\d )/(\w )\)", ["value", "unit"], types=[int,str])
    atom_index = SimpleLineParser(r"<\s*(\d )\s*\|\s*(\w /\w )\s*\|\s*(\d )\s*>", ["num1","atom?","num2"], types=[int,str, int])
    ddx = SimpleLineParser(r"d/dx\s (\S )\s (\S )\s (\S )\s (\S )\s (\S )",["1c_ddx","2c_ddx","3c_ddx","4n_ddx","5h_ddx"], types=[str]*5)
    ddy = SimpleLineParser(r"d/dy\s (\S )\s (\S )\s (\S )\s (\S )\s (\S )",["1c_ddy","2c_ddy","3c_ddy","4n_ddy","5h_ddy"], types=[str]*5)
    ddz = SimpleLineParser(r"d/dz\s (\S )\s (\S )\s (\S )\s (\S )\s (\S )",["1c_ddz","2c_ddz","3c_ddz","4n_ddz","5h_ddz"], types=[str]*5)

    parsers=[nac_coupling, atom_index, ddx, ddy, ddz]

    def __init__(self):
        ParseSection.__init__(self, r" cartesian\s nonadiabatic\s coupling\s matrix\s elements\s \((\d )/(\w )\)",r"maximum component of gradient",multi=True)

and this outputs as follows:

('nac coupling: ', {'1c_ddz': '-0.3845687D-01', '1c_ddy': '0.2170984D-02', 'atom?': 'd/dR', 'num1': 0, 'num2': 1, '5h_ddx': '0.3118277D-03', '5h_ddy': '0.8573042D-03', '5h_ddz': '-0.1580846D-01', '1c_ddx': '0.7336802D-03', 'unit': 'bohr', '2c_ddz': '0.3120165D-02', '2c_ddx': '-0.1305555D-02', '2c_ddy': '0.8126333D-02', 'value': 1, '4n_ddy': '-0.8441980D-02', '4n_ddx': '0.1166107D-02', '4n_ddz': '0.2287865D-02', '3c_ddy': '-0.9954913D-04', '3c_ddx': '-0.2407839D-04', '3c_ddz': '0.1907032D-02'})

which is good but there are some issues:

It only prints from the very last line and this, I think, is because it overwrites due to similar other lines.
This code will only work for this specific molecule and I want something that can work for any molecule. What I mean is : in this example - I have a molecule with 15 atoms and the first atom is c (carbon) , 5th atom is h (hydrogen) and 11th atom is s (sulfur) but the total number of atoms (which is currently 15 ) and the name of atoms can be different when I have different molecule.

So I am wondering how can I write a general code that can work for a general molecule . Any help?

CodePudding user response：

This will to literally what you asked. Maybe you can use this as a basis. I just gather all the atom IDs when I find a line with "ATOM", and create the dict entries when I find a line with "d/d". I would show the output, but I just typed in faked data because I didn't want to retype all of that.

import re
from pprint import pprint

header = r"(\d  [a-z]{1,2})"

atoms = []
gather = {}
for line in open('x.txt'):
    if len(line) < 5:
        continue
    if 'ATOM' in line:
        atoms = re.findall( header, line )
        atoms = [s.replace(' ','') for s in atoms]
        continue
    if '/d' in line:
        parts = line.split()
        row = parts[0].replace('/','')
        for at,val in zip(atoms,parts[1:]):
            gather[at '_' row] = val
pprint(gather)

Here's the output from your test data. I hope you realize that the cut-and-paste data doesn't match the image. The image uses d/dx, but the cut and paste uses dE/dx. I have assumed you want the "E" in the dict tag too, but that's easy to fix if you don't.

{'10c_dEdx': '0.8337613D-02',
 '10c_dEdy': '-0.8171767D-01',
 '10c_dEdz': '-0.4316928D-02',
 '11s_dEdx': '0.3138990D-01',
 '11s_dEdy': '0.3893252D-01',
 '11s_dEdz': '0.2767787D-02',
 '12h_dEdx': '0.8416159D-02',
 '12h_dEdy': '0.3335059D-02',
 '12h_dEdz': '0.1357569D-01',
 '13h_dEdx': '0.1128067D-02',
 '13h_dEdy': '-0.1457401D-01',
 '13h_dEdz': '-0.7834375D-03',
 '14h_dEdx': '0.8941240D-02',
 '14h_dEdy': '0.4869915D-02',
 '14h_dEdz': '-0.1273530D-01',
 '15h_dEdx': '0.4292434D-03',
 '15h_dEdy': '-0.1418384D-01',
 '15h_dEdz': '-0.7764904D-03',
 '1c_dEdx': '-0.1150239D-01',
 '1c_dEdy': '0.4798462D-02',
 '1c_dEdz': '0.6015413D-05',
 '2c_dEdx': '0.2259669D-01',
 '2c_dEdy': '0.5902019D-01',
 '2c_dEdz': '0.3707704D-02',
 '3c_dEdx': '-0.3153006D-02',
 '3c_dEdy': '-0.4060517D-01',
 '3c_dEdz': '-0.2306249D-02',
 '4n_dEdx': '-0.2718508D-01',
 '4n_dEdy': '0.3404657D-01',
 '4n_dEdz': '0.1334956D-02',
 '5h_dEdx': '-0.1064344D-01',
 '5h_dEdy': '-0.1054522D-01',
 '5h_dEdz': '-0.8032586D-03',
 '6c_dEdx': '0.3017851D-01',
 '6c_dEdy': '-0.2805275D-01',
 '6c_dEdz': '-0.9413310D-03',
 '7s_dEdx': '-0.2253417D-01',
 '7s_dEdy': '0.1196856D-01',
 '7s_dEdz': '0.2069422D-03',
 '8n_dEdx': '-0.3195785D-01',
 '8n_dEdy': '0.1888257D-01',
 '8n_dEdz': '0.3914382D-03',
 '9h_dEdx': '-0.4441489D-02',
 '9h_dEdy': '0.1382483D-01',
 '9h_dEdz': '0.6724659D-03'}