Home > Mobile >  Python: Extract string from text
Python: Extract string from text

Time:10-30

I have a text in the following format:

>>name of section a     keyword
#a  
some text
some text
some text
>>END_SECTION
>>name  section b     keyword
#a
some text
some text
some text
>>END_SECTION
continues...

The 'keyword' can be either pass, fail or warn.

I want to write a code that can get the following output:

name of section  keyword

The problem is I am very new to python and don't know how to extract a string from a text file when what would be the end marker (the keyword) can vary between three different words.

I tried using the # sign as end mark for the string, but it does not work. The code I tried is:

class get_word(object):
   
  def get_sentences(self, name_section):
      with open(filename) as file_content:
        file_content.read().splitlines()
        for line in file_content:
            if name_section in line:
               start_line = file_content.index(line)
               end_line = file_content[start_line:].index('#')
               data = file_content[start_line:start_line   end_line]
               return data 

I have imported the code to a different script and wrote this code:

import get_word

for data in f.get_sentences('name_section_a'):
    print(data)

But gives the following error:

ValueError: False is not in list

Is there a better way to do this? Could it be possible to use re.match() for example?

Any help would be greatly appreciated!

CodePudding user response:

I don't understand what you're doing with the code that searches for #. It has nothing to do with returning the name of section keyword line.

class get_word(object):
   
    def get_sentences(self, name_section):
        prefix = ">>"   name_section
        with open(filename) as file_content:
            for line in file_content:
                if line.startswith(prefix):
                    return line.strip()

This doesn't return a list, it just returns a single line, so there's no need for a loop in the caller.

CodePudding user response:

The exact output you expect is unclear, but you can use a regex to extract the field:

import re

re.findall('(?<=^>>)(?!END_SECTION)(.*)', text, re.M)

Output:

['name of section a     keyword', 'name  section b     keyword']
if you want to separate the keyword and remove the number:
re.findall('(?<=^>>)(?!END_SECTION)(.*)\s \w \s (\w )', t, re.M)

Output:

[('name of section', 'keyword'), ('name  section', 'keyword')]

Input:

text = '''>>name of section a     keyword
#a  
some text
some text
some text
>>END_SECTION
>>name  section b     keyword
#a
some text
some text
some text
>>END_SECTION'''

CodePudding user response:

If you want to use regex for this, it should certainly be possible, though you'll need to use something like negative lookahead ?! to skip the lines like >>END_SECTION for example.

The following regex should capture both the section names and keywords in the section start lines:

^>>(?!END_SECTION)(.*)[ ]{5}(. )$

You can try it out here on Regex Demo as well. Note that the first captured group is the section name, and the second in is the matched keyword at the end of the line.

If you want, here's a Python example that can be used to test with. Note that I'm using StringIO which represents a file-like object.

import re
from io import StringIO

file_contents = StringIO("""
>>name of section a     keyword
#a
some text
some text
some text
>>END_SECTION
>>name  section b     keyword
#a
some text
some text
some text
>>END_SECTION
continues...
""")

string = file_contents.read()

pattern = re.compile(r'^>>(?!END_SECTION)(.*)[ ]{5}(. )$', flags=re.MULTILINE)

section_names = pattern.finditer(string)
for section in section_names:
    # section is a Match object, we can access attributes like the matched
    # groups from the object.
    print(section.groups())

Output:

('name of section a', 'keyword')
('name  section b', 'keyword')

If you go with a non-regex solution, the following approach should also work:

string = file_contents.read()
lines = string.strip().split('\n')

# separator between section name and keyword
sep = '     '

section_lines = [line.lstrip('>').split(sep, 1) for line in lines
                 if line.startswith('>>') and line[2:5] != 'END']

print(section_lines)

Prints:

[['name of section a', 'keyword'], ['name  section b', 'keyword']]
  • Related