Python dataframe from text file-CodePudding

I have a text file of the following format:

//DATASET

..... unnecessary lines.....

TIMEUNITS SECONDS

    TS 0  1.98849600e 08
        3.30000000e-03    1.25400000e-02    5.88000000e-03    0.00000000e 00    0.00000000e 00
        5.88000000e-03    3.33000000e-03    2.16000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98853209e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        1.25400000e-02    5.88000000e-03    3.33000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98860419e 08
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98864081e 08
        1.08000000e-03    8.70000000e-04    7.20000000e-04    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98867619e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00

I am also attaching the sample text file named "D50.bc" in this link: https://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing.

There are no column names but it is fine to add column names i.e., V1, V2 etc. In the real text file, there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns correspond to certain values(i.e., velocity/shear stress, etc.) at that node

I want to extract all the data/tables under the lines "TS 0 XXX" into individual dfs based on their timestamps(TS) so that I can do column operations for each TS. The TS value XXXX which is in second can be added as a separate column in the dfs. Following is the pattern of output I want in a pandas dataframe:

#          V1      V2      V3 V4 V5 grp node        TS                      TS0
# 1.1 0.00330 0.01254 0.00588  0  0   1    1 198849600     TS 0  1.98849600e 08
# 1.2 0.00588 0.00333 0.00216  0  0   1    2 198849600     TS 0  1.98849600e 08
# 2.1 0.00000 0.00000 0.00000  0  0   2    1 198853209     TS 0  1.98853209e 08
# 2.2 0.01254 0.00588 0.00333  0  0   2    2 198853209     TS 0  1.98853209e 08
# 3.1 0.00333 0.00216 0.00108  0  0   3    1 198860419     TS 0  1.98860419e 08
# 3.2 0.00000 0.00000 0.00000  0  0   3    2 198860419     TS 0  1.98860419e 08
# 4.1 0.00108 0.00087 0.00072  0  0   4    1 198864081     TS 0  1.98864081e 08
# 4.2 0.00000 0.00000 0.00000  0  0   4    2 198864081     TS 0  1.98864081e 08
# 5.1 0.00000 0.00000 0.00000  0  0   5    1 198867619     TS 0  1.98867619e 08
# 5.2 0.00333 0.00216 0.00108  0  0   5    2 198867619     TS 0  1.98867619e 08

Here is the R code that solved the problem but I want a python code doing the same thing.

spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
  out <- read.table(text = S[-1], header = FALSE)
  out$grp <- grp
  out$node <- seq_len(nrow(out))
  TS <- trimws(strsplit(S[1], "\\s ")[[1]])
  out$TS <- as.numeric(TS[length(TS)])
  out$TS0 <- S[1]
  out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)

CodePudding user response：

I'll give it a shot:

import pandas as pd

#open and read file
with open('D50.bc') as f:
    text = f.read()

#create list of TS groups
data_string = [i.splitlines() for i in text.split('TIMEUNITS SECONDS\n')[1].split('TS ')[1:]]

#create nested dictionary of TS values
data = [{'TS0': f'TS {i[0]}', 'TS': int(float((i[0].split()[-1]))), 'grp': n1, 'data':[{'data':{f'V{n3}': float(z) for n3, z in enumerate(x.split('    ')[1:])}, 'node':n2} for n2, x in enumerate(i[1:])]} for n1, i in enumerate(data_string)]

#load to dataframe and flatten nested dict
df = pd.DataFrame(d).explode('data')
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))

Output:

TS0	TS	V11
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1
TS 0 1.98849600e 08	198849600	1

CodePudding user response：

I wrote this with an eye towards speed and efficiency, with the goal of avoiding unnecessary file copies, and to avoid holding the entire file in memory at once. For the same reason, it uses the Pandas C engine to read the file. The logic ends up being more complex than the other answer.

#!/usr/bin/env python3
import io
import pandas as pd


def parse_chunk(current_ts_chunk, index, current_ts):
    buf = io.StringIO(''.join(current_ts_chunk))
    chunk = pd.read_csv(buf, delim_whitespace=True, header=None)
    chunk['group'] = index
    chunk['node'] = chunk.index   1
    chunk['TS'] = current_ts.split()[2]
    chunk['TS0'] = current_ts
    return chunk


with open("D50.bc", "rt") as f:
    # Advance through file until you get to TIMEUNITS SECONDS
    for line in f:
        if line.startswith("TIMEUNITS SECONDS"):
            break
    chunk_list = []
    current_ts = None
    current_ts_chunk = []
    index = 1
    for line in f:
        if line.startswith("TS "):
            if current_ts is None:
                # First TS in file
                current_ts = line.strip()
            else:
                # Write out previous TS
                chunk = parse_chunk(current_ts_chunk, index, current_ts)
                chunk_list.append(chunk)
                # Get ready to accept current TS
                current_ts = line.strip()
                current_ts_chunk = []
                index  = 1
        else:
            # This is a data line
            current_ts_chunk.append(line)
    if current_ts is not None:
        chunk = parse_chunk(current_ts_chunk, index, current_ts)
        chunk_list.append(chunk)

    df = pd.concat(chunk_list)

print(df)

which gives this result:

         0       1        2  ...  node              TS                   TS0
0  0.00000  0.0000  0.00000  ...     1  1.98849600e 08  TS 0  1.98849600e 08
1  0.00000  0.0000  0.00000  ...     2  1.98849600e 08  TS 0  1.98849600e 08
2  0.00000  0.0000  0.00000  ...     3  1.98849600e 08  TS 0  1.98849600e 08
3  0.00000  0.0000  0.00000  ...     4  1.98849600e 08  TS 0  1.98849600e 08
4  0.00000  0.0000  0.00000  ...     5  1.98849600e 08  TS 0  1.98849600e 08
5  0.00000  0.0000  0.00000  ...     6  1.98849600e 08  TS 0  1.98849600e 08
6  0.00000  0.0000  0.00000  ...     7  1.98849600e 08  TS 0  1.98849600e 08
7  0.00000  0.0000  0.00000  ...     8  1.98849600e 08  TS 0  1.98849600e 08