Home > Software engineering >  Re matching all the text between two occurences of a word
Re matching all the text between two occurences of a word

Time:10-31

I have a text file whose contents follows a set of rules. Here is a snippet of the file:

<class 'NXOpen.Features.FeatureCollection'>
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(0)
    Parents:
    Children:
        Name:  , JournalIdentifier: SKETCH(1:1B)
    Expressions:
    Entities:
        Name:  , JournalIdentifier: HANDLE R-849
        Name:  , JournalIdentifier: HANDLE R-850
        Name:  , JournalIdentifier: DATUM_CSYS(0) YZ plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) XZ plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) XY plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) X axis
        Name:  , JournalIdentifier: DATUM_CSYS(0) Y axis
        Name:  , JournalIdentifier: DATUM_CSYS(0) Z axis
Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(1)inf
    Parents:
        Name:  , JournalIdentifier: DATUM_CSYS(0)
    Children:
        Name:  , JournalIdentifier: SKETCH(1)
    Expressions:
    Entities:
        Name:  , JournalIdentifier: HANDLE R-4283
        Name:  , JournalIdentifier: HANDLE R-4284
        Name:  , JournalIdentifier: SKETCH(1:1B) YZ plane
        Name:  , JournalIdentifier: SKETCH(1:1B) XZ plane
        Name:  , JournalIdentifier: SKETCH(1:1B) XY plane
        Name:  , JournalIdentifier: SKETCH(1:1B) X axis
        Name:  , JournalIdentifier: SKETCH(1:1B) Y axis
        Name:  , JournalIdentifier: SKETCH(1:1B) Z axis

I want to use re to then extract all the text between two Type: tags, for example, I want to extract this:

Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(0)
    Parents:
    Children:
        Name:  , JournalIdentifier: SKETCH(1:1B)
    Expressions:
    Entities:
        Name:  , JournalIdentifier: HANDLE R-849
        Name:  , JournalIdentifier: HANDLE R-850
        Name:  , JournalIdentifier: DATUM_CSYS(0) YZ plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) XZ plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) XY plane
        Name:  , JournalIdentifier: DATUM_CSYS(0) X axis
        Name:  , JournalIdentifier: DATUM_CSYS(0) Y axis
        Name:  , JournalIdentifier: DATUM_CSYS(0) Z axis

and this

Type: <class 'NXOpen.Features.DatumCsys'> FeatureName: Datum Coordinate System(1)inf
    Parents:
        Name:  , JournalIdentifier: DATUM_CSYS(0)
    Children:
        Name:  , JournalIdentifier: SKETCH(1)
    Expressions:
    Entities:
        Name:  , JournalIdentifier: HANDLE R-4283
        Name:  , JournalIdentifier: HANDLE R-4284
        Name:  , JournalIdentifier: SKETCH(1:1B) YZ plane
        Name:  , JournalIdentifier: SKETCH(1:1B) XZ plane
        Name:  , JournalIdentifier: SKETCH(1:1B) XY plane
        Name:  , JournalIdentifier: SKETCH(1:1B) X axis
        Name:  , JournalIdentifier: SKETCH(1:1B) Y axis
        Name:  , JournalIdentifier: SKETCH(1:1B) Z axis

using regular expressions. I have been trying this in :

re.findall('Type: [\w\s] )Type:', string)

but this gives me an empty list. What should be the correct re expression to achieve this?

Thank You.

CodePudding user response:

In your pattern Type: [\w\s] )Type: there is an unclosed )

The problem with matching Type: 2 times in the pattern, is that the second time you match it in the pattern, it will prevent the next match starting with Type: as it is already matched.


You can use a pattern to match Type: followed by all lines that do not start with it.

^Type: .*(?:\n(?!Type: ).*)*

Using re.findall the line would be

re.findall(r"^Type: .*(?:\n(?!Type: ).*)*", string, re.M)

Regex demo | Python demo

CodePudding user response:

Try

"(?ms)^Type:(?:(?!^Type:).)*"

Where (?ms) turns on re.MULTILINE and re.DOTALL modes so ^ matches the start of each line and . matches any character including newlines.

  • Related