Home > Back-end >  How to slice text preceding a list of re.findall results?
How to slice text preceding a list of re.findall results?

Time:07-31

Text:

some text some text Jack is the CEO. some text some text John DOE is the CEO. 

Function to find all the 'is the CEO' in the text.

def get_ceo(text):
   results = re.findall(r"is the CEO", text)
   for i in results:
       range = text[i-15:i]
       print(range)

With get_ceo, I want to extract the result of findall 15 characters of the text preceding it. I'm putting an arbitrary number of characters and I'll then perform an entity extraction with NLP on the range returned for each result.

Desired output: ['some text Jack is the CEO',' text John DOE is the CEO']

Here is the error I'm getting with the function:

  line 62, in <module>
    print(get_ceo(text))
  line 50, in get_ceo
    range = text[i-15:i]
TypeError: unsupported operand type(s) for -: 'str' and 'int'

Do I need to convert the result of the findall function into a different type or change the approach completely?

CodePudding user response:

you would want to use finditer instead of findall since findall gets you the string itself and with finditer you can access the index of the substring you are looking for

def get_ceo(text):
   results = re.finditer(r"is the CEO", text)
   for i in results:
       range = text[i.start()-15:i.end()]
       print(range)

text = "some text some text Jack is the CEO. some text some text John DOE is the CEO."
get_ceo(text)

output:

some text Jack is the CEO
 text John DOE is the CEO

CodePudding user response:

Instead of finding is the CEO, use that as a lookahead and match the 15 characters before it.

def get_ceo(text):
    results = re.findall(r'.{1,15}(?=is the CEO)', text)
    print(results)

prints:

['some text Jack ', ' text John DOE ']

CodePudding user response:

Have a try with:

import re

def get_ceo(text):
    results = re.findall(r'.{0,15}?is the CEO', text)
    print(results)

l = ['some text some text Jack is the CEO. some text some text John DOE is the CEO.', 
     'a is the CEO b is the CEO', 
     'is the CEO there?']

for el in l:
    get_ceo(el)

Prints:

['some text Jack is the CEO', ' text John DOE is the CEO']
['a is the CEO', ' b is the CEO']
['is the CEO']
  • Related