I want to extract all the text in the bullet points numbered as 1.1, 1.2, 1.3 etc. Sometimes the bullet points can have space like 1. 1, 1. 2, 1 .3, 1 . 4
Sample text
text = "some text before pattern 1.1 text_1_here 1.2 text_2_here 1 . 3 text_3_here 1. 4 text_4_here 1 .5 text_5_here 1.10 last_text_here 1.23 text after pattern"
For the text above, the output should be [' text_1_here ', ' text_2_here ', ' text_3_here ', ' text_4_here ', ' text_5_here ', ' last_text_here ']
I tried regex findall but not getting the required output. It is able to identify and extract 1.1 & 1.2 and then 1.3 & 1.4. It is skipping text between 1.2 & 1.3.
import re
re.findall(r'[0-9].\s?[0-9] (.*?)[0-9].\s?[0-9] ', text)
CodePudding user response:
I'm unsure about the exact rule why you'd want to exclude the last bit of text but based on your comments it seems we could also just split the entire text on the bullits and simply exclude the 1st and last element from the resulting array:
re.split(r'\s \d(?:\s*\.\s*\d ) \s ', text)[1:-1]
Which would output:
['text_1_here', 'text_2_here', 'text_3_here', 'text_4_here', 'text_5_here', 'last_text_here']