I have a text file that I want to extract data from using Python. An example of the file is as follows:
123
text I want to extract is here.
456
I also need this part
789
and this
Now, if I use a regular expression as follows:
re.match(r'^(\d){3}$', text)
I can get the numbers, however, I would want to get the text between the numbers. I know I can use re.split, but if I do re.split with the same expression i.e.
re.split(r'^(\d){3}$', text)
it will split it as follows:
['123, 'text I want to extract is here. 456 I also need this part 789 and this']
The outcome I want to obtain is instead, this:
['text I want to extract is here.','I also need this part', 'and this']
Any advice on how to achieve this?
Thanks!
CodePudding user response:
here is one way to it
txt="""123
text I want to extract is here.
456
I also need this part
789
and this"""
s=re.findall(r'(\D*)', txt)
[i.strip() for i in s if i != ""]
Result
['text I want to extract is here.', 'I also need this part', 'and this']
CodePudding user response:
If you want the line after the 3 digits, and that line should not start with a digit you can use a capture group:
^\d{3}\n([^\d\n].*)
The pattern matches:
^
Start of string\d{3}\n
Match 3 digits and a newline(
Capture group 1[^\d\n].*
Match a char other than a newline or digit
)
Close group 1
Example
import re
pattern = r"^\d{3}\n([^\d\n].*)"
s = ("123\n"
"text I want to extract is here. \n"
"456\n"
"I also need this part\n"
"789\n"
"and this ")
result = [x.strip() for x in re.findall(pattern, s, re.M)]
print(result)
Output
['text I want to extract is here.', 'I also need this part', 'and this']
If you just need the next line:
^\d{3}\n(. )