Home > Software design >  How to align text file with awk in python?
How to align text file with awk in python?

Time:05-04

I have this array:

dihedrals=['na-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-ca  4    1.200       180.000           2.000', 'Pd-4n-na-hn   4    4.800         0.000           2.000', 'na-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-ca   4    1.200       180.000           2.000', 'Pd-2e-na-ca   4    1.200       180.000           2.000', 'cc-4n-na-hn   4    4.800         0.000           2.000', 'Pd-4n-na-cd   4    4.800         0.000           2.000', 'Pd-2e-na-cc   4    1.200       180.000           2.000', 'X -4n-na-X   2    3.400       180.000           2.000', 'Pd-4n-cc-h4   4    4.200       180.000           2.000', 'Pd-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-cd  4    1.200       180.000           2.000', 'na-2e-na-cc  4    1.200       180.000           2.000', 'cc-4n-na-cd   4    4.800         0.000           2.000', 'na-2e-na-ca  4    1.200       180.000           2.000', 'Pd-2e-na-cc  4    1.200       180.000           2.000', 'na-2e-na-cc   4    1.200       180.000           2.000', 'Pd-2e-na-cd  4    1.200       180.000           2.000', 'na-4n-cc-h4   4    4.200       180.000           2.000']

and I want to write it in a file like this:

na-2e-na-cd   4    1.200       180.000           2.000
Pd-2e-na-cd   4    1.200       180.000           2.000
Pd-4n-na-hn   4    4.800         0.000           2.000
na-4n-cc-cc   4    4.200       180.000           2.000
na-2e-na-ca   4    1.200       180.000           2.000
cc-4n-na-hn   4    4.800         0.000           2.000
Pd-4n-na-cd   4    4.800         0.000           2.000
Pd-2e-na-cc   4    1.200       180.000           2.000
X -4n-na-X    2    3.400       180.000           2.000
Pd-4n-cc-h4   4    4.200       180.000           2.000
Pd-4n-cc-cc   4    4.200       180.000           2.000
na-2e-na-cc   4    1.200       180.000           2.000
cc-4n-na-cd   4    4.800         0.000           2.000
na-4n-cc-h4   4    4.200       180.000           2.000

I tried:

!awk '{print $1"   "$2"    "$3"       "$4"           "$5}' a.txt

But awk sees extra field in this row: "X -4n-na-X " because there is a space next to X . I tried to change the field separator as two spaces with-F="[[:space:]][[:space:]] ":

import os
    for x in range(len(dihedrals)):
        dihedrals[x]=os.popen('echo "{}" |awk -F="[[:space:]][[:space:]] "  \'{{  printf "%0s %0s %0s %0s %0s",$1,$2,$3,$4,$5,$6}}\'  '.format(dihedrals[x])).read()
        print(dihedrals[x])

But nothing changed. I also tried printf %s:

import os
for x in range(len(dihedrals)):
    dihedrals[x]=os.popen('echo "{}"|awk \'{{printf "%0s %3s %8s s s",$1,$2,$3,$4,$5}}\'  '.format(dihedrals[x])).read()

But again it didn't work. How can I write my variable into a file as I explained above?

I also tried python formatting, regex, exc... but I couldn't accomplish.

NOTE: I also tried column -t a.txt but again I am in trouble with X space row (X -4n-na-X) Here is result:

na-2e-na-cd  4         1.200  180.000  2.000
Pd-2e-na-cd  4         1.200  180.000  2.000
Pd-2e-na-ca  4         1.200  180.000  2.000
Pd-4n-na-hn  4         4.800  0.000    2.000
na-4n-cc-cc  4         4.200  180.000  2.000
na-2e-na-ca  4         1.200  180.000  2.000
Pd-2e-na-ca  4         1.200  180.000  2.000
cc-4n-na-hn  4         4.800  0.000    2.000
Pd-4n-na-cd  4         4.800  0.000    2.000
Pd-2e-na-cc  4         1.200  180.000  2.000
X            -4n-na-X  2      3.400    180.000  2.000
Pd-4n-cc-h4  4         4.200  180.000  2.000
Pd-4n-cc-cc  4         4.200  180.000  2.000
na-2e-na-cd  4         1.200  180.000  2.000
na-2e-na-cc  4         1.200  180.000  2.000
cc-4n-na-cd  4         4.800  0.000    2.000
na-2e-na-ca  4         1.200  180.000  2.000
Pd-2e-na-cc  4         1.200  180.000  2.000
na-2e-na-cc  4         1.200  180.000  2.000
Pd-2e-na-cd  4         1.200  180.000  2.000

CodePudding user response:

You can use formatted output in python for this array. We just need to split each line using 2 spaces to get individual fields.

import re

dihedrals=['na-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-cd   4    1.200       180.000           2.000', 'Pd-2e-na-ca  4    1.200       180.000           2.000', 'Pd-4n-na-hn   4    4.800         0.000           2.000', 'na-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-ca   4    1.200       180.000           2.000', 'Pd-2e-na-ca   4    1.200       180.000           2.000', 'cc-4n-na-hn   4    4.800         0.000           2.000', 'Pd-4n-na-cd   4    4.800         0.000           2.000', 'Pd-2e-na-cc   4    1.200       180.000           2.000', 'X -4n-na-X   2    3.400       180.000           2.000', 'Pd-4n-cc-h4   4    4.200       180.000           2.000', 'Pd-4n-cc-cc   4    4.200       180.000           2.000', 'na-2e-na-cd  4    1.200       180.000           2.000', 'na-2e-na-cc  4    1.200       180.000           2.000', 'cc-4n-na-cd   4    4.800         0.000           2.000', 'na-2e-na-ca  4    1.200       180.000           2.000', 'Pd-2e-na-cc  4    1.200       180.000           2.000', 'na-2e-na-cc   4    1.200       180.000           2.000', 'Pd-2e-na-cd  4    1.200       180.000           2.000', 'na-4n-cc-h4   4    4.200       180.000           2.000']
for i in dihedrals:
     a = re.split(' {2,}', i)
     print( "%-11s  %2s   %8s   s  s" % (a[0], a[1], a[2], a[3], a[4]) )

Output:

na-2e-na-cd   4      1.200        180.000         2.000
Pd-2e-na-cd   4      1.200        180.000         2.000
Pd-2e-na-ca   4      1.200        180.000         2.000
Pd-4n-na-hn   4      4.800          0.000         2.000
na-4n-cc-cc   4      4.200        180.000         2.000
na-2e-na-ca   4      1.200        180.000         2.000
Pd-2e-na-ca   4      1.200        180.000         2.000
cc-4n-na-hn   4      4.800          0.000         2.000
Pd-4n-na-cd   4      4.800          0.000         2.000
Pd-2e-na-cc   4      1.200        180.000         2.000
X -4n-na-X    2      3.400        180.000         2.000
Pd-4n-cc-h4   4      4.200        180.000         2.000
Pd-4n-cc-cc   4      4.200        180.000         2.000
na-2e-na-cd   4      1.200        180.000         2.000
na-2e-na-cc   4      1.200        180.000         2.000
cc-4n-na-cd   4      4.800          0.000         2.000
na-2e-na-ca   4      1.200        180.000         2.000
Pd-2e-na-cc   4      1.200        180.000         2.000
na-2e-na-cc   4      1.200        180.000         2.000
Pd-2e-na-cd   4      1.200        180.000         2.000
na-4n-cc-h4   4      4.200        180.000         2.000

A gnu-awk solution would be:

... |
awk -F ' {2,}' -v RS=', *|\\]' '
gsub(/dihedrals=\[|\047/, "") {
   printf( "%-11s  %2s   %8s   s  s\n", $1, $2, $3, $4, $5)
}'

CodePudding user response:

Assumptions:

  • 1st column always consists of 11 characters

I don't use python so I'll simulate the behavior (python making repeated calls out to awk) with a bash array and a bash/for loop that calls awk:

Setup:

declare -a dihedrals=([0]="na-2e-na-cd   4    1.200       180.000           2.000" [1]="Pd-2e-na-cd   4    1.200       180.000           2.000" [2]="Pd-2e-na-ca  4    1.200       180.000           2.000" [3]="Pd-4n-na-hn   4    4.800         0.000           2.000" [4]="na-4n-cc-cc   4    4.200       180.000           2.000" [5]="na-2e-na-ca   4    1.200       180.000           2.000" [6]="Pd-2e-na-ca   4    1.200       180.000           2.000" [7]="cc-4n-na-hn   4    4.800         0.000           2.000" [8]="Pd-4n-na-cd   4    4.800         0.000           2.000" [9]="Pd-2e-na-cc   4    1.200       180.000           2.000" [10]="X -4n-na-X   2    3.400       180.000           2.000" [11]="Pd-4n-cc-h4   4    4.200       180.000           2.000" [12]="Pd-4n-cc-cc   4    4.200       180.000           2.000" [13]="na-2e-na-cd  4    1.200       180.000           2.000" [14]="na-2e-na-cc  4    1.200       180.000           2.000" [15]="cc-4n-na-cd   4    4.800         0.000           2.000" [16]="na-2e-na-ca  4    1.200       180.000           2.000" [17]="Pd-2e-na-cc  4    1.200       180.000           2.000" [18]="na-2e-na-cc   4    1.200       180.000           2.000" [19]="Pd-2e-na-cd  4    1.200       180.000           2.000" [20]="na-4n-cc-h4   4    4.200       180.000           2.000")

Proposed code:

for x in "${dihedrals[@]}"
do
    awk '{ f1=substr($0,1,11)
           split(substr($0,12),a)
           printf "s %2s %7s s s\n",f1,a[1],a[2],a[3],a[4]}' <<< "${x}"
done

This generates:

na-2e-na-cd  4   1.200      180.000         2.000
Pd-2e-na-cd  4   1.200      180.000         2.000
Pd-2e-na-ca  4   1.200      180.000         2.000
Pd-4n-na-hn  4   4.800        0.000         2.000
na-4n-cc-cc  4   4.200      180.000         2.000
na-2e-na-ca  4   1.200      180.000         2.000
Pd-2e-na-ca  4   1.200      180.000         2.000
cc-4n-na-hn  4   4.800        0.000         2.000
Pd-4n-na-cd  4   4.800        0.000         2.000
Pd-2e-na-cc  4   1.200      180.000         2.000
X -4n-na-X   2   3.400      180.000         2.000
Pd-4n-cc-h4  4   4.200      180.000         2.000
Pd-4n-cc-cc  4   4.200      180.000         2.000
na-2e-na-cd  4   1.200      180.000         2.000
na-2e-na-cc  4   1.200      180.000         2.000
cc-4n-na-cd  4   4.800        0.000         2.000
na-2e-na-ca  4   1.200      180.000         2.000
Pd-2e-na-cc  4   1.200      180.000         2.000
na-2e-na-cc  4   1.200      180.000         2.000
Pd-2e-na-cd  4   1.200      180.000         2.000
na-4n-cc-h4  4   4.200      180.000         2.000

From a performance perspective I'd think the same (awk) logic should be doable within python thus eliminating the need for the repeated calls out to awk ... ???

CodePudding user response:

Do you have to use awk? It seems like something along these lines would accomplish the same goal in plain Python:

with open('a.txt', 'w') as fp:
  fp.write('\n'.join(dihedrals))
  • Related