Home > Blockchain >  Python extract text between regex matches
Python extract text between regex matches

Time:06-30

I have a text file that I want to extract data from using Python. An example of the file is as follows:

123
text I want to extract is here. 
456
I also need this part
789
and this 

Now, if I use a regular expression as follows:

re.match(r'^(\d){3}$', text)

I can get the numbers, however, I would want to get the text between the numbers. I know I can use re.split, but if I do re.split with the same expression i.e.

re.split(r'^(\d){3}$', text)

it will split it as follows:

['123, 'text I want to extract is here. 456 I also need this part 789 and this']

The outcome I want to obtain is instead, this:

['text I want to extract is here.','I also need this part', 'and this']

Any advice on how to achieve this?

Thanks!

CodePudding user response:

here is one way to it

txt="""123
text I want to extract is here. 
456
I also need this part
789
and this""" 

s=re.findall(r'(\D*)', txt)
[i.strip()  for i in s if i != ""]

Result

['text I want to extract is here.', 'I also need this part', 'and this']

CodePudding user response:

If you want the line after the 3 digits, and that line should not start with a digit you can use a capture group:

^\d{3}\n([^\d\n].*)

The pattern matches:

  • ^ Start of string
  • \d{3}\n Match 3 digits and a newline
  • ( Capture group 1
    • [^\d\n].* Match a char other than a newline or digit
  • ) Close group 1

Regex demo

Example

import re

pattern = r"^\d{3}\n([^\d\n].*)"

s = ("123\n"
            "text I want to extract is here. \n"
            "456\n"
            "I also need this part\n"
            "789\n"
            "and this ")

result = [x.strip() for x in re.findall(pattern, s, re.M)]
print(result)

Output

['text I want to extract is here.', 'I also need this part', 'and this']

If you just need the next line:

^\d{3}\n(. )

Regex demo

  • Related