Pandas magic with ugly CSV format-CodePudding

An ancient atomic simulation software is producing really ugly CSV file, which I want to import to pandas dataframe. The format looks like this:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 1 2 2.3 0.4
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 215 0.4 12.4 2.4
...
...
ITEM: TIMESTEP
1002
...

Basically it is a repeating header with the information of the iteration number. To me it seems that the easiest way to turn this into pandas would be df with the variables ['id', 'type', 'xs', 'ys', 'zs'] and add to it a new column "TIMESTEP", so it would be a nice 2D df. Alternatively could be a multi-index array Timestep 1 -> internal_df['id', 'type', 'xs', 'ys', 'zs']

The information lines (1-9) can be deleted.

The end result would ideally look like this:

    Index   a   b           c           d           TIMESTEP
    1       1   0           0           0.142857    0
    2       1   0.0714286   0.0714286   0.142857    0
    3       1   0.0714286   0           0.214286    0
    4       1   0           0.0714286   0.214286    0
    5       1   0.142857    0           0.142857    0
    ...
    474     1   0.78636     0.788005    0.425791    100002

Would you suggest a string-formatting script (example would be appreciated), or maybe Pandas read_csv with a smart set of settings could do it out of the box?

EDITED: Added true INFORMATION bit of the header, which is to be discarded (bit from "Number of Entries" to "Variables" line)

Cheers

CodePudding user response：

With pandas, here is one way on how you can approach that (to give you just the general logic).

#pip install pandas
import pandas as pd 
import numpy as np

df = pd.read_csv(StringIO(s), sep="/", header=None)

m1 = df[0].str.contains("TIMESTEP")
m2 = df[0].str.contains("Information").fillna(False)
m3 = df[0].str.contains("ITEM|Variables|\ ", regex=True).fillna(False)

conds, vals = [m1|m1.shift(1).fillna(False), m2|m3], ["DATA", "TO_SKIP"]

out = (
        df
          .assign(flag= pd.Series(np.select(conds, vals, None)).bfill().ffill())
          .pivot(columns="flag", values=0)
          .loc[:, "DATA"].dropna()
          .str.split(expand=True)
          .assign(col= lambda x: x[0].shift(-1).where(x[1].str.contains("TIMESTEP")).ffill())
          .set_axis(["Index", "a", "b", "c", "d", "TIMESTEP"], axis=1)
          .dropna(how="any")
          .reset_index(drop=True)
        )

Output :

print(out)

  Index    a          b          c         d TIMESTEP
0     1    1          0          0  0.142857        0
1     2    1  0.0714286  0.0714286  0.142857        0
2     3    1  0.0714286          0  0.214286        0
3     4    1          0  0.0714286  0.214286        0
4   491    1          2        2.3       0.4        0
5     1    1          0          0  0.142857        0
6     2    1  0.0714286  0.0714286  0.142857        0
7     3    1  0.0714286          0  0.214286        0
8     4    1          0  0.0714286  0.214286        0
9   491  215        0.4       12.4       2.4        0

Disclaimer: This approach may not be efficient for large files.