Home > Software engineering >  Reading a text file with multuple header lines and multiple sections
Reading a text file with multuple header lines and multiple sections

Time:09-22

I have a file that has multiple "header lines" (identical) for each section (ID) and has multiple IDs. see below;

IDS:    df-1    df-1    df-1
Date/Time   L   A   Z
yyy-mm-dd H:M:S in  ft/s2   ft
2022-04-28 13:30:00 NA  0   NA
2022-04-28 13:45:00 NA  2.968   NA
2022-04-28 14:00:00 2.427   3.124   22.818
IDS:    dat-2   dat-2   dat-2
Date/Time   L   A   Z
yyy-mm-dd H:M:S in  ft/s2   ft
2022-04-29 10:15:00 1.352   2.137   5.552
2022-04-29 10:30:00 1.365   2.303   6.065

This is a tab-separated file, and I tried using tsconvert package to no avail.

My first idea is to remove the header lines and only keep the ID lines, split the data at those lines and assign the ID to each section, and then bind the data into a single dataframe with an ID column and four other columns for Datetime, L, A, and Z.

However, this seems to involve a lot of preprocessing. I was hoping for a better (read direct) solution.

CodePudding user response:

An option is to read with readLines and then do the changes

i1 <- grepl('IDS', lines)
ids <- gsub("IDS:\\s (\\S )\\s .*", "\\1", lines[i1])
df1 <- do.call(rbind,  Map(cbind, id = ids, lapply(split(lines, 
   cumsum(i1)), function(x) 
    read.table(text = sub("^(\\d{4}.*:\\d{2})\\s ", "'\\1' ", 
    x[-(1:3)]), header = FALSE, col.names = strsplit(x[2], "\\s ")[[1]],
    check.names = FALSE))))
row.names(df1) <- NULL

-output

> df1
     id           Date/Time     L     A      Z
1  df-1 2022-04-28 13:30:00    NA 0.000     NA
2  df-1 2022-04-28 13:45:00    NA 2.968     NA
3  df-1 2022-04-28 14:00:00 2.427 3.124 22.818
4 dat-2 2022-04-29 10:15:00 1.352 2.137  5.552
5 dat-2 2022-04-29 10:30:00 1.365 2.303  6.065

data

lines <- readLines('file.txt')
  • Related