How to properly extract blocks of data from a file using my RegEx string?-CodePudding

Introduction

I am trying to parse information using RegEx which is structured like this:

1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1

Each piece of information is a new line, I could go line by line, but I believe that a RegEx string would be sufficient to defeat this issue.

Intention

I would like to extract it block by block, where a block would be:

1. Data
  A. Data sub 1
  B. Data sub 2

My attempt

I was able to observe that there is a "pattern" in this data and though that I could try to extract it using the next RegEx string:

(?s)(?=1.)(.*?)(?=(2. ))

Which succesfully extracts a block, but if the block contains a number such that it is include in the expresision, the block extracted is incompleted and corrupts the output file

What I expect

I would like to extract the data blocks without being interrupted by a string or char found between the defined start and end.

CodePudding user response：

I would use re.split here, splitting on a newline if it is followed by \d\.:

text = '''1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1'''

import re

blocks = re.split('\s*\n(?=\d \.)', text)

output:

['1. Data\n  A. Data sub 1\n  B. Data sub 2',
 '2. Data\n  A. Data sub 1\n  B. Data sub 2\n  C. Data sub 3\n  D. Data sub 4',
 '3. Data\n  A. Data sub 1']

In a loop:

for block in re.split('\s*\n(?=\d \.)', text):
    print('--- NEW BLOCK ---')
    print(block)

output:

--- NEW BLOCK ---
1. Data
  A. Data sub 1
  B. Data sub 2
--- NEW BLOCK ---
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4
--- NEW BLOCK ---
3. Data
  A. Data sub 1