How to align text file with awk in python?-CodePudding

I have this array:

dihedrals=['na-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-ca  4    1.200       180.000           2.000', 'Pd-4n-na-hn   4    4.800         0.000           2.000', 'na-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-ca   4    1.200       180.000           2.000', 'Pd-2e-na-ca   4    1.200       180.000           2.000', 'cc-4n-na-hn   4    4.800         0.000           2.000', 'Pd-4n-na-cd   4    4.800         0.000           2.000', 'Pd-2e-na-cc   4    1.200       180.000           2.000', 'X -4n-na-X   2    3.400       180.000           2.000', 'Pd-4n-cc-h4   4    4.200       180.000           2.000', 'Pd-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-cd  4    1.200       180.000           2.000', 'na-2e-na-cc  4    1.200       180.000           2.000', 'cc-4n-na-cd   4    4.800         0.000           2.000', 'na-2e-na-ca  4    1.200       180.000           2.000', 'Pd-2e-na-cc  4    1.200       180.000           2.000', 'na-2e-na-cc   4    1.200       180.000           2.000', 'Pd-2e-na-cd  4    1.200       180.000           2.000', 'na-4n-cc-h4   4    4.200       180.000           2.000']

and I want to write it in a file like this:

na-2e-na-cd   4    1.200       180.000           2.000
Pd-2e-na-cd   4    1.200       180.000           2.000
Pd-4n-na-hn   4    4.800         0.000           2.000
na-4n-cc-cc   4    4.200       180.000           2.000
na-2e-na-ca   4    1.200       180.000           2.000
cc-4n-na-hn   4    4.800         0.000           2.000
Pd-4n-na-cd   4    4.800         0.000           2.000
Pd-2e-na-cc   4    1.200       180.000           2.000
X -4n-na-X    2    3.400       180.000           2.000
Pd-4n-cc-h4   4    4.200       180.000           2.000
Pd-4n-cc-cc   4    4.200       180.000           2.000
na-2e-na-cc   4    1.200       180.000           2.000
cc-4n-na-cd   4    4.800         0.000           2.000
na-4n-cc-h4   4    4.200       180.000           2.000

I tried:

!awk '{print $1"   "$2"    "$3"       "$4"           "$5}' a.txt

But awk sees extra field in this row: "X -4n-na-X " because there is a space next to X . I tried to change the field separator as two spaces with-F="[[:space:]][[:space:]] ":

import os
    for x in range(len(dihedrals)):
        dihedrals[x]=os.popen('echo "{}" |awk -F="[[:space:]][[:space:]] "  \'{{  printf "%0s %0s %0s %0s %0s",$1,$2,$3,$4,$5,$6}}\'  '.format(dihedrals[x])).read()
        print(dihedrals[x])

But nothing changed. I also tried printf %s:

import os
for x in range(len(dihedrals)):
    dihedrals[x]=os.popen('echo "{}"|awk \'{{printf "%0s %3s %8s s s",$1,$2,$3,$4,$5}}\'  '.format(dihedrals[x])).read()

But again it didn't work. How can I write my variable into a file as I explained above?

I also tried python formatting, regex, exc... but I couldn't accomplish.

NOTE: I also tried column -t a.txt but again I am in trouble with X space row (X -4n-na-X) Here is result:

na-2e-na-cd  4         1.200  180.000  2.000
Pd-2e-na-cd  4         1.200  180.000  2.000
Pd-2e-na-ca  4         1.200  180.000  2.000
Pd-4n-na-hn  4         4.800  0.000    2.000
na-4n-cc-cc  4         4.200  180.000  2.000
na-2e-na-ca  4         1.200  180.000  2.000
Pd-2e-na-ca  4         1.200  180.000  2.000
cc-4n-na-hn  4         4.800  0.000    2.000
Pd-4n-na-cd  4         4.800  0.000    2.000
Pd-2e-na-cc  4         1.200  180.000  2.000
X            -4n-na-X  2      3.400    180.000  2.000
Pd-4n-cc-h4  4         4.200  180.000  2.000
Pd-4n-cc-cc  4         4.200  180.000  2.000
na-2e-na-cd  4         1.200  180.000  2.000
na-2e-na-cc  4         1.200  180.000  2.000
cc-4n-na-cd  4         4.800  0.000    2.000
na-2e-na-ca  4         1.200  180.000  2.000
Pd-2e-na-cc  4         1.200  180.000  2.000
na-2e-na-cc  4         1.200  180.000  2.000
Pd-2e-na-cd  4         1.200  180.000  2.000

CodePudding user response：

You can use formatted output in python for this array. We just need to split each line using 2 spaces to get individual fields.

import re

dihedrals=['na-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-ca  4    1.200       180.000           2.000', 'Pd-4n-na-hn   4    4.800         0.000           2.000', 'na-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-ca   4    1.200       180.000           2.000', 'Pd-2e-na-ca   4    1.200       180.000           2.000', 'cc-4n-na-hn   4    4.800         0.000           2.000', 'Pd-4n-na-cd   4    4.800         0.000           2.000', 'Pd-2e-na-cc   4    1.200       180.000           2.000', 'X -4n-na-X   2    3.400       180.000           2.000', 'Pd-4n-cc-h4   4    4.200       180.000           2.000', 'Pd-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-cd  4    1.200       180.000           2.000', 'na-2e-na-cc  4    1.200       180.000           2.000', 'cc-4n-na-cd   4    4.800         0.000           2.000', 'na-2e-na-ca  4    1.200       180.000           2.000', 'Pd-2e-na-cc  4    1.200       180.000           2.000', 'na-2e-na-cc   4    1.200       180.000           2.000', 'Pd-2e-na-cd  4    1.200       180.000           2.000', 'na-4n-cc-h4   4    4.200       180.000           2.000']
for i in dihedrals:
     a = re.split(' {2,}', i)
     print( "%-11s  %2s   %8s   s  s" % (a[0], a[1], a[2], a[3], a[4]) )

Output:

na-2e-na-cd   4      1.200        180.000         2.000
Pd-2e-na-cd   4      1.200        180.000         2.000
Pd-2e-na-ca   4      1.200        180.000         2.000
Pd-4n-na-hn   4      4.800          0.000         2.000
na-4n-cc-cc   4      4.200        180.000         2.000
na-2e-na-ca   4      1.200        180.000         2.000
Pd-2e-na-ca   4      1.200        180.000         2.000
cc-4n-na-hn   4      4.800          0.000         2.000
Pd-4n-na-cd   4      4.800          0.000         2.000
Pd-2e-na-cc   4      1.200        180.000         2.000
X -4n-na-X    2      3.400        180.000         2.000
Pd-4n-cc-h4   4      4.200        180.000         2.000
Pd-4n-cc-cc   4      4.200        180.000         2.000
na-2e-na-cd   4      1.200        180.000         2.000
na-2e-na-cc   4      1.200        180.000         2.000
cc-4n-na-cd   4      4.800          0.000         2.000
na-2e-na-ca   4      1.200        180.000         2.000
Pd-2e-na-cc   4      1.200        180.000         2.000
na-2e-na-cc   4      1.200        180.000         2.000
Pd-2e-na-cd   4      1.200        180.000         2.000
na-4n-cc-h4   4      4.200        180.000         2.000

A gnu-awk solution would be:

... |
awk -F ' {2,}' -v RS=', *|\\]' '
gsub(/dihedrals=\[|\047/, "") {
   printf( "%-11s  %2s   %8s   s  s\n", $1, $2, $3, $4, $5)
}'

CodePudding user response：

Assumptions:

1st column always consists of 11 characters

I don't use python so I'll simulate the behavior (python making repeated calls out to awk) with a bash array and a bash/for loop that calls awk:

Setup:

declare -a dihedrals=([0]="na-2e-na-cd   4    1.200       180.000           2.000" [1]="Pd-2e-na-cd   4    1.200       180.000           2.000" [2]="Pd-2e-na-ca  4    1.200       180.000           2.000" [3]="Pd-4n-na-hn   4    4.800         0.000           2.000" [4]="na-4n-cc-cc   4    4.200       180.000           2.000" [5]="na-2e-na-ca   4    1.200       180.000           2.000" [6]="Pd-2e-na-ca   4    1.200       180.000           2.000" [7]="cc-4n-na-hn   4    4.800         0.000           2.000" [8]="Pd-4n-na-cd   4    4.800         0.000           2.000" [9]="Pd-2e-na-cc   4    1.200       180.000           2.000" [10]="X -4n-na-X   2    3.400       180.000           2.000" [11]="Pd-4n-cc-h4   4    4.200       180.000           2.000" [12]="Pd-4n-cc-cc   4    4.200       180.000           2.000" [13]="na-2e-na-cd  4    1.200       180.000           2.000" [14]="na-2e-na-cc  4    1.200       180.000           2.000" [15]="cc-4n-na-cd   4    4.800         0.000           2.000" [16]="na-2e-na-ca  4    1.200       180.000           2.000" [17]="Pd-2e-na-cc  4    1.200       180.000           2.000" [18]="na-2e-na-cc   4    1.200       180.000           2.000" [19]="Pd-2e-na-cd  4    1.200       180.000           2.000" [20]="na-4n-cc-h4   4    4.200       180.000           2.000")

Proposed code:

for x in "${dihedrals[@]}"
do
    awk '{ f1=substr($0,1,11)
           split(substr($0,12),a)
           printf "s %2s %7s s s\n",f1,a[1],a[2],a[3],a[4]}' <<< "${x}"
done

This generates:

na-2e-na-cd  4   1.200      180.000         2.000
Pd-2e-na-cd  4   1.200      180.000         2.000
Pd-2e-na-ca  4   1.200      180.000         2.000
Pd-4n-na-hn  4   4.800        0.000         2.000
na-4n-cc-cc  4   4.200      180.000         2.000
na-2e-na-ca  4   1.200      180.000         2.000
Pd-2e-na-ca  4   1.200      180.000         2.000
cc-4n-na-hn  4   4.800        0.000         2.000
Pd-4n-na-cd  4   4.800        0.000         2.000
Pd-2e-na-cc  4   1.200      180.000         2.000
X -4n-na-X   2   3.400      180.000         2.000
Pd-4n-cc-h4  4   4.200      180.000         2.000
Pd-4n-cc-cc  4   4.200      180.000         2.000
na-2e-na-cd  4   1.200      180.000         2.000
na-2e-na-cc  4   1.200      180.000         2.000
cc-4n-na-cd  4   4.800        0.000         2.000
na-2e-na-ca  4   1.200      180.000         2.000
Pd-2e-na-cc  4   1.200      180.000         2.000
na-2e-na-cc  4   1.200      180.000         2.000
Pd-2e-na-cd  4   1.200      180.000         2.000
na-4n-cc-h4  4   4.200      180.000         2.000

From a performance perspective I'd think the same (awk) logic should be doable within python thus eliminating the need for the repeated calls out to awk ... ???

CodePudding user response：

Do you have to use awk? It seems like something along these lines would accomplish the same goal in plain Python:

with open('a.txt', 'w') as fp:
  fp.write('\n'.join(dihedrals))