I'm trying to extract the labels, the uppercase words, and the text in between until the next label using regex. Here is a sample text:
Iron DEFINITION it is a chemical element with symbol Fe (from Latin: ferrum) and atomic number 26. CATEGORY It is a metal that belongs to the first transition series and group 8 of the periodic table. APPEARANCE A shiny, greyish metal that rusts in damp air, Its USE AND PROPERTIES Most is used to manufacture steel, used in civil engineering (reinforced concrete, girders etc) and in manufacturing. HISTORY Iron objects have been found in Egypt dating from around 3500 BC. They contain about 7.5% nickel, which indicates that they were of meteoric origin. MISC. It is, by mass, the most common element on Earth, right in front of oxygen (32.1% and 30.1%, respectively), forming much of Earth's outer and inner core. It is the fourth most common element in the Earth's crust.
This is what I expect:
Label: DEFINITION
Text: it is a chemical element with symbol Fe (from Latin: ferrum) and atomic number 26.
Label: CATEGORY
Text: t is a metal that belongs to the first transition series and group 8 of the periodic table.
And so on.
I created this regex, but it doesn't work:
(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC.|HISTORY)(.*)((?=DEFINITION)|(?=CATEGORY)|(?=USE AND PROPERTIES)|(?=MISC.)|(?=HISTORY))
How can I get achieve this using regex?
CodePudding user response:
You can use
\b(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY)\s*(\S.*?)(?=DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY|$)
See the regex demo.
Details:
\b(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY)
- a word boundary and then Group 1 capturing eitherDEFINITION
,CATEGORY
,USE AND PROPERTIES
,MISC.
orHISTORY
char sequences\s*
- zero or more whitespaces (so as to exclude them from Group 2)(\S.*?)
- Group 2: a non-whitespace char and then any zero or more chars other than line break chars as few as possible(?=DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY|$)
- a positive lookahead that requires (immediately to the right of the current location) either aDEFINITION
, orCATEGORY
, orUSE AND PROPERTIES
, orMISC.
, orHISTORY
char sequences or end of string.