Python regex get multiple lines after specific word-CodePudding

The string is stored in a variable text. When I do print(text) I get the output:

SHIP TO
Flensburg House, MMDA Colony,
Arumbakkam,Chennai, Tamil Nadu,

I need to get the text:

Flensburg House, MMDA Colony,
Arumbakkam,Chennai, Tamil Nadu,

Here's what I have tried:

shipto=[]
shipto_re=re.compile(r"SHIP TO((?:.*\n){1,3})")
for line in text.split():
      if shipto_re.match(line):
        shipto.append(line)

However, this isn't giving me a match,I know the regex works , so the problem definitely lies in how to iterate through the text variable.

CodePudding user response：

You are using a regex that matches across lines, but you split the string with whitespace and test each split "token" against the regex.

You need to use

import re
text = r'''SHIP TO
Flensburg House, MMDA Colony,
Arumbakkam,Chennai, Tamil Nadu,
'''
shipto_re=re.compile(r"SHIP TO((?:.*\n){1,3})")
shipto = [x.strip() for x in shipto_re.findall(text)]
print(shipto)
# => ['Flensburg House, MMDA Colony,\nArumbakkam,Chennai, Tamil Nadu,']

See the Python demo.

Here, Pattern.findall is used to extract Group 1 value from the matches, and each match is stripped off any leading and trailing whitespace with str.strip().

More considerations

If you plan to match a line even if it is at the end of a string, you need to replace the regex with

shipto_re=re.compile(r"SHIP TO(.*(?:\n.*){0,2})")

The SHIP TO(.*(?:\n.*){0,2}) matches SHIP TO and then captures into Group 1 any text till end of the current line, then zero, one or two sequences of a newline (LF) char and then the rest of that line (with (.*(?:\n.*){0,2})).

CodePudding user response：

Here you go... sample code ->

import re
regex = r"SHIP TO(.*)"
test_str = ("SHIP TO\n"
    "Flensburg House, MMDA Colony,\n"
    "Arumbakkam,Chennai, Tamil Nadu,")
matches = re.finditer(regex, test_str, re.DOTALL)
for matchNum, match in enumerate(matches, start=1):
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum   1
        lines = match.group(groupNum).strip().split("\n")
        print(lines)

The thing is you have to use re.DOTALL flag

CodePudding user response：

You regex is correct.

I believe your issue is the use of text.split() which by default splits on any whitespace meaning it trys to match per word.

Instead simply use findall.

import re

text="""SHIP TO
Flensburg House, MMDA Colony,
Arumbakkam,Chennai, Tamil Nadu,
"""
shipto=[]
shipto_re=re.compile(r"SHIP TO((?:.*\n){1,3})")
shipto  = shipto_re.findall(text)

print (shipto)

fiddle: https://www.mycompiler.io/view/CgUoXrxqKxK