Home > Net >  How can I repeatedly parse text in a text file between two strings?
How can I repeatedly parse text in a text file between two strings?

Time:09-29

I have a text file the contains a table like the following:

---
Title of my file
Subtitle of my file
---

 ------ ------------------- ------ 
|  a   |        aa         | aaa  |
|  b   |        bb         | bbb  |
|  c   |        cc         | ccc  |
|  d   |        dd         | ddd  |      # Section 1
|  e   |        ee         | eee  |
|  f   |        ff         | fff  |
 ====== =================== ====== 
|  g   |        gg         | ggg  |
|  h   |        hh         | hhh  |
|  i   |        ii         | iii  |      # Section 2
|  j   |        jj         | jjj  |
|  k   |        kk         | kkk  |
|  l   |        ll         | lll  |
 ------ ------------------- ------ 

And I'm trying parse with python to capture each section into a separate list, section1_list and section_2_list, with each list containinng the lines in the section. For example, section_1_list would be:

section_1_list = [
    "|  a   |        aa         | aaa  |",
    "|  b   |        bb         | bbb  |",
    "|  c   |        cc         | ccc  |",
    "|  d   |        dd         | ddd  |",
    "|  e   |        ee         | eee  |",
    "|  f   |        ff         | fff  |"
]

Notice that this is without the diving lines.

So my question is: how can I write my loop so that that I can ignore the dividing lines and gather the others into their own list?

**What I have tried:

Extract Values between two strings in a text file using python

Python read specific lines of text between two strings

**What I currently have:

with open(txt_file_path) as f:
    lines = f.readlines()

row_start = False

for line in lines:
    if "-----" in line or "=====" in line:
        block_text = []
        row_start = not row_start

    while row_start == True:
        block_text.append(line)

Edit: I say repeatedly in the title because I have around 16 of these blocks in the text file.

CodePudding user response:

Try the following approach.

  1. Read the contents of the file.
  2. Replace the first and last lines of the table (using re)
  3. Split the data based on the line separators in the table (using re)

See the following code:

import re
with open(txt_file_path,"r") as f:
    data = f.read()
    data = re.sub(r"[- ] ","",data)
    block_text = re.split(r"[ =] ",data)

CodePudding user response:

Here's how I would do:

from pprint import pprint

file_contents = """\
---
Title of my file
Subtitle of my file
---

 ------ ------------------- ------ 
|  a   |        aa         | aaa  |
|  b   |        bb         | bbb  |
|  c   |        cc         | ccc  |
|  d   |        dd         | ddd  |      # Section 1
|  e   |        ee         | eee  |
|  f   |        ff         | fff  |
 ====== =================== ====== 
|  g   |        gg         | ggg  |
|  h   |        hh         | hhh  |
|  i   |        ii         | iii  |      # Section 2
|  j   |        jj         | jjj  |
|  k   |        kk         | kkk  |
|  l   |        ll         | lll  |
 ------ ------------------- ------ \
"""
lines = file_contents.split('\n')

# TODO update as needed
start_end_line_prefixes = (' ---', ' ===')

sections = []
curr_section = None

for line in lines:
    if any(line.startswith(prefix) for prefix in start_end_line_prefixes):
        curr_section = []
        sections.append(curr_section)
    elif curr_section is not None:
        curr_section.append(line)

# Remove empty list in last index (if needed)
if not sections[-1]:
    sections.pop()

pprint(sections)

Output:

[['|  a   |        aa         | aaa  |',
  '|  b   |        bb         | bbb  |',
  '|  c   |        cc         | ccc  |',
  '|  d   |        dd         | ddd  |      # Section 1',
  '|  e   |        ee         | eee  |',
  '|  f   |        ff         | fff  |'],
 ['|  g   |        gg         | ggg  |',
  '|  h   |        hh         | hhh  |',
  '|  i   |        ii         | iii  |      # Section 2',
  '|  j   |        jj         | jjj  |',
  '|  k   |        kk         | kkk  |',
  '|  l   |        ll         | lll  |']]
  • Related