extracting nth and mth lines everytime after a match is found without reading the whole file-CodePudding

I have the following sample text file:

DATASET
OBJTYPE "mesh2d"
BEGALD
ND 58673
NC 116294
TIMEUNITS SECONDS
TS 0  1.98849600e 08
    0.000000000e 00   
    0.56000000e 00   
    0.200000000e 00   
    0.00000000e 00   
    0.100000000e 00   
    0.00000000e 00   
    0.00000000e 00   
    0.73400000e 00   
TS 0  1.98853209e 08
    0.00000000e 00   
    1.00500000e 00   
    4.00000000e 00   
    6.00000000e-05   
    9.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
TS 0  1.98856959e 08
    0.00000000e 00   
    1.38000000e 00   
    4.00000000e 00   
    3.00000000e-05   
    8.10000000e 00   
    2.45000000e 00   
    0.00000000e 00   
    0.00000000e 00   
TS 0  1.98860419e 08
    0.00000000e 00   
    1.40000000e 00   
    7.00000000e 00   
    3.00000000e-05   
    9.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
TS 0  1.98864081e 08
    0.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
    3.00000000e-05   
    0.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
    0.00000000e 00   
TS 0  1.98867619e 08
    0.00000000e 00   
    0.00000000e 00   
    8.00000000e 00   
    3.50000000e-05   
    10.00000000e 00   
    0.00000000e 00   
    5.50000000e 00   
    0.00000000e 00
ENDDS

I want to extract the time stamps from the line starting with 'TS 0 ' and the 2nd, 5th and 8th lines after every 'TS 0 ' match is found. Now, I have huge file which is more than 10 GB, so I don't want to read the whole file into memory.

This is what I could come up with:

with open(r"file") as f:
    for line in f:
       if line.startswith("TIMEUNITS SECONDS"):
           break  # file handlers will start from next line
    time=[] # list for storing time stamps 
    line2=[]   # or lines=[2,5,8] 
    line5=[]
    line8=[]
    line
    for line in f:
        
        if line.startswith("TS"):
            print(line.strip()) # extract all TS
            ts=float(line.split()[2])
        time.append(ts)

It only extracts the time stamps but how to extract the 2nd,5th and 8th lines using a loop or any other faster method without reading the whole file.

CodePudding user response：

A file object is iterable in python, and retains its position between calls to iter, which you've used to skip the initial section. Keep using the same technique to find the lines you need:

with open(r"file") as f:
    for line in f:
       if line.startswith("TIMEUNITS SECONDS"):
           break
    time = [] 
    line2 = []
    line5 = []
    line8 = []
    for line in f:
        if line.startswith("TS"):
            ts = float(line.strip().split()[2])
            time.append(ts)
            for _ in range(2):
                line = next(file)
            line2.append(float(line.strip()))
            for _ in range(3):
                line = next(file)
            line5.append(float(line.strip()))
            for _ in range(3):
                line = next(file)
            line8.append(float(line.strip()))

Now that you have the basic structure down, you can factor out the repeated code into a function and add some error checking:

def find(file, s):
    for line in file:
        if line.startswith(s):
            return line
    return None

def skip(file, n):
    for i, line in zip(range(n), file):
        pass
    return line if i == n - 1 else None

def load(filename):
    with open(filename) as f:
        if not find(f, "TIMEUNITS SECONDS"):
            return None

        time = [] 
        line2 = []
        line5 = []
        line8 = []
        while True:
            if not (line := find(f, "TS")):
                break
            time.append(float(line.strip().split()[2]))
            if not (line := skip(f, 2)):
                break
            line2.append(float(line.strip()))
            if not (line := skip(f, 3):
                break
            line5.append(float(line.strip()))
            if not (line := skip(f, 3):
                break
            line8.append(float(line.strip()))
    return time, line2, line5, line8