Home > OS >  Pandas magic with ugly CSV format
Pandas magic with ugly CSV format

Time:02-02

An ancient atomic simulation software is producing really ugly CSV file, which I want to import to pandas dataframe. The format looks like this:

ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 1 2 2.3 0.4
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 215 0.4 12.4 2.4
...
...
ITEM: TIMESTEP
1002
...

Basically it is a repeating header with the information of the iteration number. To me it seems that the easiest way to turn this into pandas would be df with the variables ['id', 'type', 'xs', 'ys', 'zs'] and add to it a new column "TIMESTEP", so it would be a nice 2D df. Alternatively could be a multi-index array Timestep 1 -> internal_df['id', 'type', 'xs', 'ys', 'zs']

The information lines (1-9) can be deleted.

The end result would ideally look like this:

    Index   a   b           c           d           TIMESTEP
    1       1   0           0           0.142857    0
    2       1   0.0714286   0.0714286   0.142857    0
    3       1   0.0714286   0           0.214286    0
    4       1   0           0.0714286   0.214286    0
    5       1   0.142857    0           0.142857    0
    ...
    474     1   0.78636     0.788005    0.425791    100002

Would you suggest a string-formatting script (example would be appreciated), or maybe Pandas read_csv with a smart set of settings could do it out of the box?

EDITED: Added true INFORMATION bit of the header, which is to be discarded (bit from "Number of Entries" to "Variables" line)

Cheers

CodePudding user response:

With , here is one way on how you can approach that (to give you just the general logic).

#pip install pandas
import pandas as pd 
import numpy as np
​
df = pd.read_csv(StringIO(s), sep="/", header=None)
​
m1 = df[0].str.contains("TIMESTEP")
m2 = df[0].str.contains("Information").fillna(False)
m3 = df[0].str.contains("ITEM|Variables|\ ", regex=True).fillna(False)
​
conds, vals = [m1|m1.shift(1).fillna(False), m2|m3], ["DATA", "TO_SKIP"]
​
out = (
        df
          .assign(flag= pd.Series(np.select(conds, vals, None)).bfill().ffill())
          .pivot(columns="flag", values=0)
          .loc[:, "DATA"].dropna()
          .str.split(expand=True)
          .assign(col= lambda x: x[0].shift(-1).where(x[1].str.contains("TIMESTEP")).ffill())
          .set_axis(["Index", "a", "b", "c", "d", "TIMESTEP"], axis=1)
          .dropna(how="any")
          .reset_index(drop=True)
        )
​

Output :

print(out)

  Index    a          b          c         d TIMESTEP
0     1    1          0          0  0.142857        0
1     2    1  0.0714286  0.0714286  0.142857        0
2     3    1  0.0714286          0  0.214286        0
3     4    1          0  0.0714286  0.214286        0
4   491    1          2        2.3       0.4        0
5     1    1          0          0  0.142857        0
6     2    1  0.0714286  0.0714286  0.142857        0
7     3    1  0.0714286          0  0.214286        0
8     4    1          0  0.0714286  0.214286        0
9   491  215        0.4       12.4       2.4        0

Disclaimer: This approach may not be efficient for large files.

  • Related