I have a text file whose contents follows a set of rules. Here is a snippet of the file:
<class 'NXOpen.Features.FeatureCollection'>
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(0)
Parents:
Children:
Name: , JournalIdentifier: SKETCH(1:1B)
Expressions:
Entities:
Name: , JournalIdentifier: HANDLE R-849
Name: , JournalIdentifier: HANDLE R-850
Name: , JournalIdentifier: DATUM_CSYS(0) YZ plane
Name: , JournalIdentifier: DATUM_CSYS(0) XZ plane
Name: , JournalIdentifier: DATUM_CSYS(0) XY plane
Name: , JournalIdentifier: DATUM_CSYS(0) X axis
Name: , JournalIdentifier: DATUM_CSYS(0) Y axis
Name: , JournalIdentifier: DATUM_CSYS(0) Z axis
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(1)inf
Parents:
Name: , JournalIdentifier: DATUM_CSYS(0)
Children:
Name: , JournalIdentifier: SKETCH(1)
Expressions:
Entities:
Name: , JournalIdentifier: HANDLE R-4283
Name: , JournalIdentifier: HANDLE R-4284
Name: , JournalIdentifier: SKETCH(1:1B) YZ plane
Name: , JournalIdentifier: SKETCH(1:1B) XZ plane
Name: , JournalIdentifier: SKETCH(1:1B) XY plane
Name: , JournalIdentifier: SKETCH(1:1B) X axis
Name: , JournalIdentifier: SKETCH(1:1B) Y axis
Name: , JournalIdentifier: SKETCH(1:1B) Z axis
I want to use re
to then extract all the text between two Type: tags, for example, I want to extract this:
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(0)
Parents:
Children:
Name: , JournalIdentifier: SKETCH(1:1B)
Expressions:
Entities:
Name: , JournalIdentifier: HANDLE R-849
Name: , JournalIdentifier: HANDLE R-850
Name: , JournalIdentifier: DATUM_CSYS(0) YZ plane
Name: , JournalIdentifier: DATUM_CSYS(0) XZ plane
Name: , JournalIdentifier: DATUM_CSYS(0) XY plane
Name: , JournalIdentifier: DATUM_CSYS(0) X axis
Name: , JournalIdentifier: DATUM_CSYS(0) Y axis
Name: , JournalIdentifier: DATUM_CSYS(0) Z axis
and this
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(1)inf
Parents:
Name: , JournalIdentifier: DATUM_CSYS(0)
Children:
Name: , JournalIdentifier: SKETCH(1)
Expressions:
Entities:
Name: , JournalIdentifier: HANDLE R-4283
Name: , JournalIdentifier: HANDLE R-4284
Name: , JournalIdentifier: SKETCH(1:1B) YZ plane
Name: , JournalIdentifier: SKETCH(1:1B) XZ plane
Name: , JournalIdentifier: SKETCH(1:1B) XY plane
Name: , JournalIdentifier: SKETCH(1:1B) X axis
Name: , JournalIdentifier: SKETCH(1:1B) Y axis
Name: , JournalIdentifier: SKETCH(1:1B) Z axis
using regular expressions. I have been trying this in python:
re.findall('Type: [\w\s] )Type:', string)
but this gives me an empty list. What should be the correct re
expression to achieve this?
Thank You.
CodePudding user response:
In your pattern Type: [\w\s] )Type:
there is an unclosed )
The problem with matching Type:
2 times in the pattern, is that the second time you match it in the pattern, it will prevent the next match starting with Type: as it is already matched.
You can use a pattern to match Type: followed by all lines that do not start with it.
^Type: .*(?:\n(?!Type: ).*)*
Using re.findall the line would be
re.findall(r"^Type: .*(?:\n(?!Type: ).*)*", string, re.M)
CodePudding user response:
Try
"(?ms)^Type:(?:(?!^Type:).)*"
Where (?ms)
turns on re.MULTILINE
and re.DOTALL
modes so ^
matches the start of each line and .
matches any character including newlines.