An ancient atomic simulation software is producing really ugly CSV file, which I want to import to pandas dataframe. The format looks like this:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 1 2 2.3 0.4
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
491
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
0.0000000000000000e 00 2.8000000000000000e 01
ITEM: ATOMS id type xs ys zs
1 1 0 0 0.142857
2 1 0.0714286 0.0714286 0.142857
3 1 0.0714286 0 0.214286
4 1 0 0.0714286 0.214286
...
491 215 0.4 12.4 2.4
...
...
ITEM: TIMESTEP
1002
...
Basically it is a repeating header with the information of the iteration number. To me it seems that the easiest way to turn this into pandas would be df with the variables ['id', 'type', 'xs', 'ys', 'zs'] and add to it a new column "TIMESTEP", so it would be a nice 2D df. Alternatively could be a multi-index array Timestep 1 -> internal_df['id', 'type', 'xs', 'ys', 'zs']
The information lines (1-9) can be deleted.
The end result would ideally look like this:
Index a b c d TIMESTEP
1 1 0 0 0.142857 0
2 1 0.0714286 0.0714286 0.142857 0
3 1 0.0714286 0 0.214286 0
4 1 0 0.0714286 0.214286 0
5 1 0.142857 0 0.142857 0
...
474 1 0.78636 0.788005 0.425791 100002
Would you suggest a string-formatting script (example would be appreciated), or maybe Pandas read_csv with a smart set of settings could do it out of the box?
EDITED: Added true INFORMATION bit of the header, which is to be discarded (bit from "Number of Entries" to "Variables" line)
Cheers
CodePudding user response:
With pandas, here is one way on how you can approach that (to give you just the general logic).
#pip install pandas
import pandas as pd
import numpy as np
df = pd.read_csv(StringIO(s), sep="/", header=None)
m1 = df[0].str.contains("TIMESTEP")
m2 = df[0].str.contains("Information").fillna(False)
m3 = df[0].str.contains("ITEM|Variables|\ ", regex=True).fillna(False)
conds, vals = [m1|m1.shift(1).fillna(False), m2|m3], ["DATA", "TO_SKIP"]
out = (
df
.assign(flag= pd.Series(np.select(conds, vals, None)).bfill().ffill())
.pivot(columns="flag", values=0)
.loc[:, "DATA"].dropna()
.str.split(expand=True)
.assign(col= lambda x: x[0].shift(-1).where(x[1].str.contains("TIMESTEP")).ffill())
.set_axis(["Index", "a", "b", "c", "d", "TIMESTEP"], axis=1)
.dropna(how="any")
.reset_index(drop=True)
)
Output :
print(out)
Index a b c d TIMESTEP
0 1 1 0 0 0.142857 0
1 2 1 0.0714286 0.0714286 0.142857 0
2 3 1 0.0714286 0 0.214286 0
3 4 1 0 0.0714286 0.214286 0
4 491 1 2 2.3 0.4 0
5 1 1 0 0 0.142857 0
6 2 1 0.0714286 0.0714286 0.142857 0
7 3 1 0.0714286 0 0.214286 0
8 4 1 0 0.0714286 0.214286 0
9 491 215 0.4 12.4 2.4 0
Disclaimer: This approach may not be efficient for large files.