Parsing a string line by line into arrays-CodePudding

I'm quite a novice in Python but I need to do some parsing for a research project. I see it as the most difficult part now that I need to overcome to do the actual science. One of basic things I need is to convert a string with the data into NumPy lists. An example of the data:

CARBON
S   9
1         6.665000E 03           6.920000E-04
2         1.000000E 03           5.329000E-03
3         2.280000E 02           2.707700E-02
4         6.471000E 01           1.017180E-01
5         2.106000E 01           2.747400E-01
6         7.495000E 00           4.485640E-01
7         2.797000E 00           2.850740E-01
8         5.215000E-01           1.520400E-02
9         1.596000E-01          -3.191000E-03
S   9
1         6.665000E 03          -1.460000E-04
2         1.000000E 03          -1.154000E-03
3         2.280000E 02          -5.725000E-03
4         6.471000E 01          -2.331200E-02
5         2.106000E 01          -6.395500E-02
6         7.495000E 00          -1.499810E-01
7         2.797000E 00          -1.272620E-01
8         5.215000E-01           5.445290E-01
9         1.596000E-01           5.804960E-01
S   1
1         1.596000E-01           1.000000E 00
P   4
1         9.439000E 00           3.810900E-02
2         2.002000E 00           2.094800E-01
3         5.456000E-01           5.085570E-01
4         1.517000E-01           4.688420E-01
P   1
1         1.517000E-01           1.000000E 00
D   1
1         5.500000E-01           1.0000000

This needs to be read for N arbitrary atoms, so can be called in a for loop and the first line may be omitted. The parser has to read the letter (S, L, P, D, F), the number Nc on the right from it, start a for loop for Nc lines and copy the 2nd and 3rd columns into NumPy arrays that may belong to some class. That would form a contracted Gaussian-type orbital and I would do some math with it. If the letter is L, I would need to use a different class because a 4th column would appear. If the Nc value == 1 it would be just one line to read and another class. After the reading of all the N strings is done, the data should look something like this:

 C         

   1   S    1    6665.000000    0.363803 (  0.000692) 
   1   S    2    1000.000000    0.675392 (  0.005329) 
   1   S    3     228.000000    1.132301 (  0.027077) 
   1   S    4      64.710000    1.654004 (  0.101718) 
   1   S    5      21.060000    1.924978 (  0.274740) 
   1   S    6       7.495000    1.448149 (  0.448564) 
   1   S    7       2.797000    0.439427 (  0.285074) 
   1   S    8       0.521500    0.006650 (  0.015204) 
   1   S    9       0.159600   -0.000574 ( -0.003191) 

   2   S   10    6665.000000   -0.076756 ( -0.000146) 
   2   S   11    1000.000000   -0.146257 ( -0.001154) 
   2   S   12     228.000000   -0.239407 ( -0.005725) 
   2   S   13      64.710000   -0.379069 ( -0.023312) 
   2   S   14      21.060000   -0.448104 ( -0.063955) 
   2   S   15       7.495000   -0.484201 ( -0.149981) 
   2   S   16       2.797000   -0.196168 ( -0.127262) 
   2   S   17       0.521500    0.238162 (  0.544529) 
   2   S   18       0.159600    0.104468 (  0.580496) 

   3   S   19       0.159600    0.179964 (  1.000000) 

   4   P   20       9.439000    0.898722 (  0.038109) 
   4   P   21       2.002000    0.711071 (  0.209480) 
   4   P   22       0.545600    0.339917 (  0.508557) 
   4   P   23       0.151700    0.063270 (  0.468842) 

   5   P   24       0.151700    0.134950 (  1.000000) 

   6   D   25       0.550000    0.578155 (  1.000000) 

 C         

   7   S   26    6665.000000    0.363803 (  0.000692) 
   7   S   27    1000.000000    0.675392 (  0.005329) 
   7   S   28     228.000000    1.132301 (  0.027077) 
   7   S   29      64.710000    1.654004 (  0.101718) 
   7   S   30      21.060000    1.924978 (  0.274740) 
   7   S   31       7.495000    1.448149 (  0.448564) 
   7   S   32       2.797000    0.439427 (  0.285074) 
   7   S   33       0.521500    0.006650 (  0.015204) 
   7   S   34       0.159600   -0.000574 ( -0.003191) 

   8   S   35    6665.000000   -0.076756 ( -0.000146) 
   8   S   36    1000.000000   -0.146257 ( -0.001154) 
   8   S   37     228.000000   -0.239407 ( -0.005725) 
   8   S   38      64.710000   -0.379069 ( -0.023312) 
   8   S   39      21.060000   -0.448104 ( -0.063955) 
   8   S   40       7.495000   -0.484201 ( -0.149981) 
   8   S   41       2.797000   -0.196168 ( -0.127262) 
   8   S   42       0.521500    0.238162 (  0.544529) 
   8   S   43       0.159600    0.104468 (  0.580496) 

   9   S   44       0.159600    0.179964 (  1.000000) 

  10   P   45       9.439000    0.898722 (  0.038109) 
  10   P   46       2.002000    0.711071 (  0.209480) 
  10   P   47       0.545600    0.339917 (  0.508557) 
  10   P   48       0.151700    0.063270 (  0.468842) 

  11   P   49       0.151700    0.134950 (  1.000000) 

  12   D   50       0.550000    0.578155 (  1.000000)

This is an example of a full basis set of a molecule made of individual atomic basis sets. The first column is the basis function number, the second is the basis function type (S, L, P, D, F, etc), the third is the primitive basis function number and the next two are those read by the parser. How would one recommend me to do it, so I get the ordered data like above? And how exactly can strings be read into arrays line by line? Python's functionality is overwhelming. I tried to use Pandas to convert a string into some array to "filter" it but it couldn't work for me.

CodePudding user response：

Maybe this can give you a start. I made up the column names; you should provide correct ones. This will handle multiple molecules in a single file, but it will just concatenate them all into one dataframe. I would guess (based on no evidence) that you probably want one dataframe per molecule.

import pandas as pd

rows = []
for line in open('x.txt'):
    parts = line.strip().split()
    if len(parts) == 1:
        print(parts[0])
        counter1 = 0
        counter2 = 0
    elif len(parts) == 2:
        counter1  = 1
        shell = (counter1, parts[0])
    else:
        counter2  = 1
        rows.append( shell   (counter2, float(parts[1]), float(parts[2])) )

df = pd.DataFrame( rows, columns=['basisnum','basistype','primitive','energy','delta'])
print(df)

Output:

CARBON
    basisnum basistype  primitive     energy     delta
0          1         S          1  6665.0000  0.000692
1          1         S          2  1000.0000  0.005329
2          1         S          3   228.0000  0.027077
3          1         S          4    64.7100  0.101718
4          1         S          5    21.0600  0.274740
5          1         S          6     7.4950  0.448564
6          1         S          7     2.7970  0.285074
7          1         S          8     0.5215  0.015204
8          1         S          9     0.1596 -0.003191
9          2         S         10  6665.0000 -0.000146
10         2         S         11  1000.0000 -0.001154
11         2         S         12   228.0000 -0.005725
12         2         S         13    64.7100 -0.023312
13         2         S         14    21.0600 -0.063955
14         2         S         15     7.4950 -0.149981
15         2         S         16     2.7970 -0.127262
16         2         S         17     0.5215  0.544529
17         2         S         18     0.1596  0.580496
18         3         S         19     0.1596  1.000000
19         4         P         20     9.4390  0.038109
20         4         P         21     2.0020  0.209480
21         4         P         22     0.5456  0.508557
22         4         P         23     0.1517  0.468842
23         5         P         24     0.1517  1.000000
24         6         D         25     0.5500  1.000000

CodePudding user response：

Thanks to Tim Roberts, I was able to write a piece of code for parsing basis sets. It's incomplete, another elif is needed to read SP/L basis functions but it works.

import basis_set_exchange as bse
import pandas as pd

basis = bse.get_basis('cc-pVDZ', fmt = 'gamess_us', elements = 'C', header = False)
basis = basis[5:-4]

print(basis, '\n')

buf = basis.split('\n')
buf.pop(2)

shellNumber = 0
shellType = ''
rows = []

for line in buf:
    parts = line.strip().split()
    if (len(parts) == 2):
        shellType = parts[0]
        shellNumber  = 1
    elif (len(parts) == 3):
        rows.append((shellType, shellNumber, float(parts[1]), float(parts[2])))

df = pd.DataFrame( rows, columns = ['SHELL TYPE','SHELL NO','EXPONENT','CONTR COEF'])
print(df)

Output:

CARBON
S   9
1         6.665000E 03           6.920000E-04
2         1.000000E 03           5.329000E-03
3         2.280000E 02           2.707700E-02
4         6.471000E 01           1.017180E-01
5         2.106000E 01           2.747400E-01
6         7.495000E 00           4.485640E-01
7         2.797000E 00           2.850740E-01
8         5.215000E-01           1.520400E-02
9         1.596000E-01          -3.191000E-03
S   9
1         6.665000E 03          -1.460000E-04
2         1.000000E 03          -1.154000E-03
3         2.280000E 02          -5.725000E-03
4         6.471000E 01          -2.331200E-02
5         2.106000E 01          -6.395500E-02
6         7.495000E 00          -1.499810E-01
7         2.797000E 00          -1.272620E-01
8         5.215000E-01           5.445290E-01
9         1.596000E-01           5.804960E-01
S   1
1         1.596000E-01           1.000000E 00
P   4
1         9.439000E 00           3.810900E-02
2         2.002000E 00           2.094800E-01
3         5.456000E-01           5.085570E-01
4         1.517000E-01           4.688420E-01
P   1
1         1.517000E-01           1.000000E 00
D   1
1         5.500000E-01           1.0000000

 

   SHELL TYPE  SHELL NO   EXPONENT  CONTR COEF
0           S         1  6665.0000    0.000692
1           S         1  1000.0000    0.005329
2           S         1   228.0000    0.027077
3           S         1    64.7100    0.101718
4           S         1    21.0600    0.274740
5           S         1     7.4950    0.448564
6           S         1     2.7970    0.285074
7           S         1     0.5215    0.015204
8           S         1     0.1596   -0.003191
9           S         2  6665.0000   -0.000146
10          S         2  1000.0000   -0.001154
11          S         2   228.0000   -0.005725
12          S         2    64.7100   -0.023312
13          S         2    21.0600   -0.063955
14          S         2     7.4950   -0.149981
15          S         2     2.7970   -0.127262
16          S         2     0.5215    0.544529
17          S         2     0.1596    0.580496
18          S         3     0.1596    1.000000
19          P         4     9.4390    0.038109
20          P         4     2.0020    0.209480
21          P         4     0.5456    0.508557
22          P         4     0.1517    0.468842
23          P         5     0.1517    1.000000
24          D         6     0.5500    1.000000

After reading the entire atomic basis set, the dataframe will have to be transformed into an actual basis set. After that, it can be used in a calculation.