I have a text file of the following format:
//DATASET
..... unnecessary lines.....
TIMEUNITS SECONDS
TS 0 1.98849600e 08
3.30000000e-03 1.25400000e-02 5.88000000e-03 0.00000000e 00 0.00000000e 00
5.88000000e-03 3.33000000e-03 2.16000000e-03 0.00000000e 00 0.00000000e 00
TS 0 1.98853209e 08
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
1.25400000e-02 5.88000000e-03 3.33000000e-03 0.00000000e 00 0.00000000e 00
TS 0 1.98860419e 08
3.33000000e-03 2.16000000e-03 1.08000000e-03 0.00000000e 00 0.00000000e 00
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
TS 0 1.98864081e 08
1.08000000e-03 8.70000000e-04 7.20000000e-04 0.00000000e 00 0.00000000e 00
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
TS 0 1.98867619e 08
0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00 0.00000000e 00
3.33000000e-03 2.16000000e-03 1.08000000e-03 0.00000000e 00 0.00000000e 00
I am also attaching the sample text file named "D50.bc" in this link: https://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing.
There are no column names but it is fine to add column names i.e., V1, V2 etc. In the real text file, there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns correspond to certain values(i.e., velocity/shear stress, etc.) at that node
I want to extract all the data/tables under the lines "TS 0 XXX" into individual dfs based on their timestamps(TS) so that I can do column operations for each TS. The TS value XXXX which is in second can be added as a separate column in the dfs. Following is the pattern of output I want in a pandas dataframe:
# V1 V2 V3 V4 V5 grp node TS TS0
# 1.1 0.00330 0.01254 0.00588 0 0 1 1 198849600 TS 0 1.98849600e 08
# 1.2 0.00588 0.00333 0.00216 0 0 1 2 198849600 TS 0 1.98849600e 08
# 2.1 0.00000 0.00000 0.00000 0 0 2 1 198853209 TS 0 1.98853209e 08
# 2.2 0.01254 0.00588 0.00333 0 0 2 2 198853209 TS 0 1.98853209e 08
# 3.1 0.00333 0.00216 0.00108 0 0 3 1 198860419 TS 0 1.98860419e 08
# 3.2 0.00000 0.00000 0.00000 0 0 3 2 198860419 TS 0 1.98860419e 08
# 4.1 0.00108 0.00087 0.00072 0 0 4 1 198864081 TS 0 1.98864081e 08
# 4.2 0.00000 0.00000 0.00000 0 0 4 2 198864081 TS 0 1.98864081e 08
# 5.1 0.00000 0.00000 0.00000 0 0 5 1 198867619 TS 0 1.98867619e 08
# 5.2 0.00333 0.00216 0.00108 0 0 5 2 198867619 TS 0 1.98867619e 08
Here is the R code that solved the problem but I want a python code doing the same thing.
spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
out <- read.table(text = S[-1], header = FALSE)
out$grp <- grp
out$node <- seq_len(nrow(out))
TS <- trimws(strsplit(S[1], "\\s ")[[1]])
out$TS <- as.numeric(TS[length(TS)])
out$TS0 <- S[1]
out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)
CodePudding user response:
I'll give it a shot:
import pandas as pd
#open and read file
with open('D50.bc') as f:
text = f.read()
#create list of TS groups
data_string = [i.splitlines() for i in text.split('TIMEUNITS SECONDS\n')[1].split('TS ')[1:]]
#create nested dictionary of TS values
data = [{'TS0': f'TS {i[0]}', 'TS': int(float((i[0].split()[-1]))), 'grp': n1, 'data':[{'data':{f'V{n3}': float(z) for n3, z in enumerate(x.split(' ')[1:])}, 'node':n2} for n2, x in enumerate(i[1:])]} for n1, i in enumerate(data_string)]
#load to dataframe and flatten nested dict
df = pd.DataFrame(d).explode('data')
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
df = df.join(pd.DataFrame(df.pop('data').values.tolist()))
Output:
TS0 | TS | grp | node | V0 | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | TS 0 1.98849600e 08 | 198849600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
CodePudding user response:
I wrote this with an eye towards speed and efficiency, with the goal of avoiding unnecessary file copies, and to avoid holding the entire file in memory at once. For the same reason, it uses the Pandas C engine to read the file. The logic ends up being more complex than the other answer.
#!/usr/bin/env python3
import io
import pandas as pd
def parse_chunk(current_ts_chunk, index, current_ts):
buf = io.StringIO(''.join(current_ts_chunk))
chunk = pd.read_csv(buf, delim_whitespace=True, header=None)
chunk['group'] = index
chunk['node'] = chunk.index 1
chunk['TS'] = current_ts.split()[2]
chunk['TS0'] = current_ts
return chunk
with open("D50.bc", "rt") as f:
# Advance through file until you get to TIMEUNITS SECONDS
for line in f:
if line.startswith("TIMEUNITS SECONDS"):
break
chunk_list = []
current_ts = None
current_ts_chunk = []
index = 1
for line in f:
if line.startswith("TS "):
if current_ts is None:
# First TS in file
current_ts = line.strip()
else:
# Write out previous TS
chunk = parse_chunk(current_ts_chunk, index, current_ts)
chunk_list.append(chunk)
# Get ready to accept current TS
current_ts = line.strip()
current_ts_chunk = []
index = 1
else:
# This is a data line
current_ts_chunk.append(line)
if current_ts is not None:
chunk = parse_chunk(current_ts_chunk, index, current_ts)
chunk_list.append(chunk)
df = pd.concat(chunk_list)
print(df)
which gives this result:
0 1 2 ... node TS TS0
0 0.00000 0.0000 0.00000 ... 1 1.98849600e 08 TS 0 1.98849600e 08
1 0.00000 0.0000 0.00000 ... 2 1.98849600e 08 TS 0 1.98849600e 08
2 0.00000 0.0000 0.00000 ... 3 1.98849600e 08 TS 0 1.98849600e 08
3 0.00000 0.0000 0.00000 ... 4 1.98849600e 08 TS 0 1.98849600e 08
4 0.00000 0.0000 0.00000 ... 5 1.98849600e 08 TS 0 1.98849600e 08
5 0.00000 0.0000 0.00000 ... 6 1.98849600e 08 TS 0 1.98849600e 08
6 0.00000 0.0000 0.00000 ... 7 1.98849600e 08 TS 0 1.98849600e 08
7 0.00000 0.0000 0.00000 ... 8 1.98849600e 08 TS 0 1.98849600e 08