How o find a string and N number of words after and before it in a list of text and-CodePudding

I have a list including the text of the documents. I am going to search for a special word on each document and then extract the 20 words after and before of string. Finally, record each finding into a data frame. I know I should used regex but I don't know how I should count before and after finding the word. And how I should set the code to continue searching for the rest of the text.

CodePudding user response：

You can use the find method and then slice the text. It would give something like this:

to_extract = ""
pos = txt.find(TO_FIND)
if pos != -1:
    if pos > 20 and pos   20 < len(txt):
        to_extract = txt[pos-20:pos 20]
    elif pos < 20:
        to_extract = txt[:pos 20]
    elif pos   20 > len(txt):
        to_extract = txt[pos-20:]

nb: I didn't test this but this is the way to go, plus it only works for the first occurence of the word

CodePudding user response：

Here is a solution using a regex. I am matching the word in and the 3 words before/after (only 3 for clarity):

re.findall(r'(?:\b\S \s*){,3}\bin\b(?:\s*\S \s*?){,3}', text)

You can test the regex here (I added a capturing group to highlight the searched word)

input text:

text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed non risus. Suspendisse lectus tortor, dignissim sit amet, adipiscing nec, ultricies sed, dolor. Cras elementum ultrices diam. Maecenas ligula massa, varius a, semper congue, euismod non, mi. Proin porttitor, orci nec nonummy molestie, enim est eleifend mi, non fermentum diam nisl sit amet erat. Duis semper. Duis arcu massa, scelerisque vitae, consequat in, pretium a, enim. Pellentesque congue. Ut in risus volutpat libero pharetra tempor. Cras vestibulum bibendum augue. Praesent egestas leo in pede. Praesent blandit odio eu enim. Pellentesque sed dui ut augue blandit sodales. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam nibh. Mauris ac mauris sed pede pellentesque fermentum. Maecenas adipiscing ante non diam sodales hendrerit.'

output:

['scelerisque vitae, consequat in, pretium a,',
 'Pellentesque congue. Ut in risus volutpat libero',
 'Praesent egestas leo in pede. Praesent blandit',
 'ante ipsum primis in faucibus orci luctus']

matching several words:

Example with sed/orci:

re.findall(r'(?:\b\S \s*){,3}\b(?:sed|orci)\b(?:\s*\S \s*?){,3}', text)

output:

['adipiscing nec, ultricies sed, dolor. Cras',
 'mi. Proin porttitor, orci nec nonummy molestie,',
 'eu enim. Pellentesque sed dui ut augue',
 'primis in faucibus orci luctus et ultrices',
 'Mauris ac mauris sed pede pellentesque fermentum.']

output: