Home > Software engineering >  More pythonic way to write DataFrame to a file
More pythonic way to write DataFrame to a file

Time:05-05

I have a large DataFrame of coordinates, and I have to write its content into an input file for another program. For now I have a nested, quite complex for-loop and many .write() statements - it takes a lot of time to write the file. I can't find nor figure out any smarter way to do it.

DataFrame

type Uiso x y z
H 0.0320 0.257510 0.254363 0.021930
H 0.0330 0.255228 0.163941 0.038431
H 0.0330 0.255228 0.163941 0.038431
C 0.0278 0.122879 0.207314 0.027545
H 0.0320 0.534974 0.254363 0.021930

Input

  • First line: atom type.
  • Second line: number of these atom types in structure, atom number (H: 1, C: 6, etc.), then occupancy (always 1), then Uiso.
  • Rest of the lines: x y z coordinates.

Repeat for each atom type in DataFrame.

  H
    4 1  1.000 0.0320
      0.257510  0.254363  0.021930
      0.255228  0.163941  0.038431
      0.255228  0.163941  0.038431
      0.534974  0.254363  0.021930
  C
    1 6  1.000 0.0278
      0.122879  0.207314  0.027545

Code:

f = open(file_name   ".inp", "w")

for type in structure["type"].unique():
    for Uiso in structure[structure["type"] == type]["Uiso"].unique():
        f.write("  "   type   "\n")
        f.write("    "   str(len(list(structure[(structure["type"] == type) & (structure["Uiso"] == Uiso)][["x", "y", "z"]].iterrows())))   " ")
        f.write(str(atomic_numbers[type]))
        f.write(str("  1.000 "))
        f.write(str(Uiso)   "\n")
        for coord in structure[(structure["type"] == type) & (structure["Uiso"] == Uiso)].itertuples():
            f.write("      "
                   str(coord.x)   " "
                   str(coord.y)   " "
                   str(coord.z)   "\n")
f.close()

Question

Is there a way to reduce the number of loops? Maybe write whole sorted portions of DataFrame at once?

CodePudding user response:

There are lots of ways to approach this, e.g. making a template string then filling it with format(), but your case might be compact enough to do it on the fly, so to speak.

For example, this does more or less what you want for a DataFrame df:

import numpy as np

with open('data.inp', 'wt') as f:
    for atom, group in df.groupby('type'):
        
        # Write the header info using an f-string.
        f.write(f"{atom}\n  {len(group)} 1  1.000 {group['Uiso'].iloc[0]:.4f}\n")
        
        # Write the data table using savetxt for fixed-width columns.
        data = group.loc[:, ['x', 'y', 'z']]
        np.savetxt(f, data, fmt='.6f', newline='\n')

This produces:

C
  1 1  1.000 0.0278
  0.122879   0.207314   0.027545
H
  4 1  1.000 0.0320
  0.257510   0.254363   0.021930
  0.255228   0.163941   0.038431
  0.255228   0.163941   0.038431
  0.534974   0.254363   0.021930

Some notes:

  • Always use a context for file reading and writing (the with block). It's the safest thing to do.
  • I tried using pd.to_csv() but I don't think it can do fixed width columns, whereas np.savetxt() can.
  • I sort of guessed/made up the info on the second line of each record; you might need to change some of that.
  • You could change the sort order by throwing sorted() around the groupby using some appropriate key (e.g. a table to look up atomic number or something).
  • Related