I have a text in the following format:
>>name of section a keyword
#a
some text
some text
some text
>>END_SECTION
>>name section b keyword
#a
some text
some text
some text
>>END_SECTION
continues...
The 'keyword' can be either pass, fail or warn.
I want to write a code that can get the following output:
name of section keyword
The problem is I am very new to python and don't know how to extract a string from a text file when what would be the end marker (the keyword) can vary between three different words.
I tried using the # sign as end mark for the string, but it does not work. The code I tried is:
class get_word(object):
def get_sentences(self, name_section):
with open(filename) as file_content:
file_content.read().splitlines()
for line in file_content:
if name_section in line:
start_line = file_content.index(line)
end_line = file_content[start_line:].index('#')
data = file_content[start_line:start_line end_line]
return data
I have imported the code to a different script and wrote this code:
import get_word
for data in f.get_sentences('name_section_a'):
print(data)
But gives the following error:
ValueError: False is not in list
Is there a better way to do this? Could it be possible to use re.match() for example?
Any help would be greatly appreciated!
CodePudding user response:
I don't understand what you're doing with the code that searches for #
. It has nothing to do with returning the name of section keyword
line.
class get_word(object):
def get_sentences(self, name_section):
prefix = ">>" name_section
with open(filename) as file_content:
for line in file_content:
if line.startswith(prefix):
return line.strip()
This doesn't return a list, it just returns a single line, so there's no need for a loop in the caller.
CodePudding user response:
The exact output you expect is unclear, but you can use a regex to extract the field:
import re
re.findall('(?<=^>>)(?!END_SECTION)(.*)', text, re.M)
Output:
['name of section a keyword', 'name section b keyword']
if you want to separate the keyword and remove the number:
re.findall('(?<=^>>)(?!END_SECTION)(.*)\s \w \s (\w )', t, re.M)
Output:
[('name of section', 'keyword'), ('name section', 'keyword')]
Input:
text = '''>>name of section a keyword
#a
some text
some text
some text
>>END_SECTION
>>name section b keyword
#a
some text
some text
some text
>>END_SECTION'''
CodePudding user response:
If you want to use regex for this, it should certainly be possible, though you'll need to use something like negative lookahead ?!
to skip the lines like >>END_SECTION
for example.
The following regex should capture both the section names and keywords in the section start lines:
^>>(?!END_SECTION)(.*)[ ]{5}(. )$
You can try it out here on Regex Demo as well. Note that the first captured group is the section name, and the second in is the matched keyword at the end of the line.
If you want, here's a Python example that can be used to test with. Note that I'm using StringIO
which represents a file-like object.
import re
from io import StringIO
file_contents = StringIO("""
>>name of section a keyword
#a
some text
some text
some text
>>END_SECTION
>>name section b keyword
#a
some text
some text
some text
>>END_SECTION
continues...
""")
string = file_contents.read()
pattern = re.compile(r'^>>(?!END_SECTION)(.*)[ ]{5}(. )$', flags=re.MULTILINE)
section_names = pattern.finditer(string)
for section in section_names:
# section is a Match object, we can access attributes like the matched
# groups from the object.
print(section.groups())
Output:
('name of section a', 'keyword')
('name section b', 'keyword')
If you go with a non-regex solution, the following approach should also work:
string = file_contents.read()
lines = string.strip().split('\n')
# separator between section name and keyword
sep = ' '
section_lines = [line.lstrip('>').split(sep, 1) for line in lines
if line.startswith('>>') and line[2:5] != 'END']
print(section_lines)
Prints:
[['name of section a', 'keyword'], ['name section b', 'keyword']]