Home > Enterprise >  Extract labels and the text in between using regex in Python
Extract labels and the text in between using regex in Python

Time:10-25

I'm trying to extract the labels, the uppercase words, and the text in between until the next label using regex. Here is a sample text:

Iron DEFINITION it is a chemical element with symbol Fe (from Latin: ferrum) and atomic number 26. CATEGORY It is a metal that belongs to the first transition series and group 8 of the periodic table. APPEARANCE A shiny, greyish metal that rusts in damp air, Its USE AND PROPERTIES Most is used to manufacture steel, used in civil engineering (reinforced concrete, girders etc) and in manufacturing. HISTORY Iron objects have been found in Egypt dating from around 3500 BC. They contain about 7.5% nickel, which indicates that they were of meteoric origin. MISC. It is, by mass, the most common element on Earth, right in front of oxygen (32.1% and 30.1%, respectively), forming much of Earth's outer and inner core. It is the fourth most common element in the Earth's crust.

This is what I expect:

Label: DEFINITION

Text: it is a chemical element with symbol Fe (from Latin: ferrum) and atomic number 26.

Label: CATEGORY

Text: t is a metal that belongs to the first transition series and group 8 of the periodic table.

And so on.

I created this regex, but it doesn't work:

(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC.|HISTORY)(.*)((?=DEFINITION)|(?=CATEGORY)|(?=USE AND PROPERTIES)|(?=MISC.)|(?=HISTORY))

How can I get achieve this using regex?

CodePudding user response:

You can use

\b(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY)\s*(\S.*?)(?=DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY|$)

See the regex demo.

Details:

  • \b(DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY) - a word boundary and then Group 1 capturing either DEFINITION, CATEGORY, USE AND PROPERTIES, MISC. or HISTORY char sequences
  • \s* - zero or more whitespaces (so as to exclude them from Group 2)
  • (\S.*?) - Group 2: a non-whitespace char and then any zero or more chars other than line break chars as few as possible
  • (?=DEFINITION|CATEGORY|USE AND PROPERTIES|MISC\.|HISTORY|$) - a positive lookahead that requires (immediately to the right of the current location) either a DEFINITION, or CATEGORY, or USE AND PROPERTIES, or MISC., or HISTORY char sequences or end of string.
  • Related