Home > database >  R or python regex for data extraction from text file
R or python regex for data extraction from text file

Time:11-20

I have a text file of the following format: //

DATASET

..... unnecessary lines.....

TIMEUNITS SECONDS

    TS 0  1.98849600e 08
        3.30000000e-03    1.25400000e-02    5.88000000e-03    0.00000000e 00    0.00000000e 00
        5.88000000e-03    3.33000000e-03    2.16000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98853209e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        1.25400000e-02    5.88000000e-03    3.33000000e-03    0.00000000e 00    0.00000000e 00
    TS 0  1.98860419e 08
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98864081e 08
        1.08000000e-03    8.70000000e-04    7.20000000e-04    0.00000000e 00    0.00000000e 00
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
    TS 0  1.98867619e 08
        0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00    0.00000000e 00
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e 00    0.00000000e 00

I am also attaching the sample text file named "D50.bc" in this link: https://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing.

There is no column names but it is fine to add column names i.e., V1, V2 etc. In the real text file there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns corresponds to certain values(i.e., velocity/shear stress etc.) at that node

I want to extract all the data/tables under the lines "TS 0 XXX" into individual dfs based on their timestamps(TS) so that I can do column operations for each TS. The TS value XXXX which is in second can be added as a separate column in the dfs. Below is the regular expression I have tried in R but it returns nothing. Any help is highly appreciated.

  library(dplyr)
    rm(list=ls(all=TRUE))
    a <- paste0(readLines("D:/D50 python/D50.bc"), collapse = "\n")
    b <- regmatches(a, gregexpr("(?s)^TS[^\n] 0.*?\n\\K.*?\n\b", a,, perl = T))[[1]]

CodePudding user response:

spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
  out <- read.table(text = S[-1], header = FALSE)
  out$grp <- grp
  out$node <- seq_len(nrow(out))
  TS <- trimws(strsplit(S[1], "\\s ")[[1]])
  out$TS <- as.numeric(TS[length(TS)])
  out$TS0 <- S[1]
  out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)
out
#          V1      V2      V3 V4 V5 grp node        TS                      TS0
# 1.1 0.00330 0.01254 0.00588  0  0   1    1 198849600     TS 0  1.98849600e 08
# 1.2 0.00588 0.00333 0.00216  0  0   1    2 198849600     TS 0  1.98849600e 08
# 2.1 0.00000 0.00000 0.00000  0  0   2    1 198853209     TS 0  1.98853209e 08
# 2.2 0.01254 0.00588 0.00333  0  0   2    2 198853209     TS 0  1.98853209e 08
# 3.1 0.00333 0.00216 0.00108  0  0   3    1 198860419     TS 0  1.98860419e 08
# 3.2 0.00000 0.00000 0.00000  0  0   3    2 198860419     TS 0  1.98860419e 08
# 4.1 0.00108 0.00087 0.00072  0  0   4    1 198864081     TS 0  1.98864081e 08
# 4.2 0.00000 0.00000 0.00000  0  0   4    2 198864081     TS 0  1.98864081e 08
# 5.1 0.00000 0.00000 0.00000  0  0   5    1 198867619     TS 0  1.98867619e 08
# 5.2 0.00333 0.00216 0.00108  0  0   5    2 198867619     TS 0  1.98867619e 08

Walk-through:

  • cumsum(grepl(...)) groups the data by leading TS 0, so that we can split them up into individual matrices:

    cumsum(grepl("^\\s*TS 0 ", txt))
    #  [1] 0 0 0 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
    
  • We split the text up by these numbers, so 0 contains all of the preamble; 1 includes the first TS 0 group, etc. In this case, the 0s correspond to all text before the first TS 0, so we can discard the first segment with [-1].

  • Map(... iterates our function(S, grp) over each of the numbered groups as well as counting along the groups, called five times in this case.

  • read.table(...) makes short work of converting blank-space delimited numbers into a data.frame without pre-defined column names. The normal frame $-assignment takes care of the group and node values, as well as parsing out the timestamp numeric in the TS 0 line.

  • do.call(rbind, alldat) is one way to combine all frames in a list into a single frame; alternatives include dplyr::bind_rows(alldat) and data.table::rbindlist(alldat). (All of these are significantly better/faster than iteratively building a frame in a for loop.)

Assumption:

  • all matrix groups have the same number of columns
  • Related