Home > front end >  Using regex for extracting values from a text file
Using regex for extracting values from a text file

Time:05-24

I have a text file from which I want to extract values at a specific distance from a string whenever the string is encountered. I'm completely new to this and got to know that these kinds of pattern matching problems can be solved using regular expressions.

<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04 
<CH> Chan   Doppler       Code     Track        CdDoppler       CodeRange
<CH>    0   1449.32  2914.6679      0.00        833359.36        -154.093
<CH>    1   1450.35  2414.8292      0.00        833951.94        -154.093
<CH>    2   1450.35  6387.2597      0.00        833951.94        -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------

The above structure is repeated multiple times inside the file. Is there any way I can derive out Doppler values (1449.32, 1450.35, 1450.35) and store it in a python list? Since it all starts with " AUTO,CHANSTATE", is there a way it can be taken as reference to get the values? Or any other way which probably I'm unable to think of. Any help will be really appreciable.

CodePudding user response:

A better approach is to parse the file line by line. Split the line over whitespace and capture the value of Doppler using list index 2. Advantage of this approach is that you can access other parameter values as well if required in future. Try this:

with open("sample.txt") as file: # Use file to refer to the file object

    for line in file:  # Parsing file line by line
        data = line.split()  # Split the line over whitespace
        try:
            if isinstance(float(data[2]), float):
                print("Doppler = ", data[2])
        except (IndexError, ValueError) as e:
            pass

Output:

Doppler =  1449.32
Doppler =  1450.35
Doppler =  1450.35

Check this for demo: https://www.online-python.com/mgE32OXJW8

CodePudding user response:

If you really want/need to use regex you could do this.

Code:

import re

text = '''<BEGIN> AUTO,CHANSTATE
<CH> Time: 2002-07-04 
<CH> Chan   Doppler       Code     Track        CdDoppler       CodeRange
<CH>    0   1449.32  2914.6679      0.00        833359.36        -154.093
<CH>    1   1450.35  2414.8292      0.00        833951.94        -154.093
<CH>    2   1450.35  6387.2597      0.00        833951.94        -154.093
<END>
<BEGIN> AUTO,CHSTAT
(it goes on)---------------------'''

find_this = re.findall('<CH>.*?[0-9].*?\s.*?([0-9].*?)\s', text)

print(find_this)
['1449.32', '1450.35', '1450.35']

[Program finished]

There is however other ways to do this without re as others have pointed out.

CodePudding user response:

Or any other way...

No regex, just string functions

  • iterate over the lines in the file
  • check if the line (starts with,contains,or equals) '<BEGIN> AUTO,CHANSTATE'
    • when it does , skip the next two lines
  • keep iterating and for each line that starts with '<CH>',
    • split the line on whitespace, save the third item of the result (result[2])
  • continue till a line (starts with,contains,or equals) '<END>'
  • do it all over again.
  • Related