I've got a list of lines of text that are comprised of alternating section-headers and section-content. I want to parse it line by line, and identify the sections and their associated content (to eventually throw together into a dictionary).
The trouble I am having is in figuring out how to parse the lines into pairs based only on iterating through the list and looking for the headers. Everytime I try I get very close, but somehow my sections end up misaligned.
I think my algorithm should be as follows:
(0) Assume no header has been identified at the beginning of the search; hence, any content seen will be ignored until a section header is encountered.
(1) When "in" a section (i.e. a section header has been encountered), accumulate all following section content and append it together, until such a time as a new section header is seen.
(2) Upon encountering the new section header, any following lines should be considered as part of the new section.
(3) Some sections may only have a header, and hence have blank content. Others may span a single or multiple lines.
In other words, given this:
garbage
Section-A-Header
section A content line 1
section A content line 2
section A content line 3
Section-B-Header
section B content line 1
section B content line 2
Section-C-Header
Section-D-Header
section D content line 1
section D content line 2
section D content line 3
...I would like to be able to construct:
{Section-A-Header: section A content line 1 section A content line 2 section A content line 3}
{Section-B-Header: section B content line 1 section B content line 2}
{Section-C-Header: None}
{Section-D-Header: section D content line 1 section D content line 2 section D content line 3}
Could anyone help me figure out a solid implementation?
CodePudding user response:
I am not sure what is the exact issue you are facing with this.
Here is a pseudocode for you to take inspiration from
file = open("sections.txt", 'r')
last_header=''
output = {}
for line in file.readlines():
if is_section_header(line):
last_header = line
output[line] = ""
else:
existing_data = output[last_header]
output[last_header] = existing_data line
print(output)
def is_section_header(line):
#some logic to identify header
return True
CodePudding user response:
This would be my approach:
result = dict()
with open('foo.txt') as foo:
section = None
for line in map(str.strip, foo):
# identify start of section
if line.startswith('Section-'):
section = line
result[section] = None
else:
if section:
if result[section]:
result[section].append(line)
else:
result[section] = [line]
Result:
{
"Section-A-Header": [
"section A content line 1",
"section A content line 2",
"section A content line 3"
],
"Section-B-Header": [
"section B content line 1",
"section B content line 2"
],
"Section-C-Header": None,
"Section-D-Header": [
"section D content line 1",
"section D content line 2",
"section D content line 3"
]
}
Note:
Written like this only because OP wants None for empty sections