Home > Software engineering >  Regex - distinguishing between very similar substrings using the substring itself
Regex - distinguishing between very similar substrings using the substring itself

Time:05-05

I know the title is rather confusing, but this is a hard problem to formulate simply. Hopefully this does have a solution and is the result of me being quite new to the world of regex.

I am trying to parse some text from a chemistry book and transform it into a JSON, but i'm having trouble dividing the text by its main identifier. All of this is being done on a Python 3.10 environment.

Consider the following string:

0047 Heptasilver nitrate octaoxide
[12258-22-9] Ag NO
7 11
(Ag O ) .AgNO
3 4 2 3
Alone, or Sulfides, or Nonmetals
The crystalline product produced by electrolytic oxidation
of silver nitrate (and possibly as formulated) detonates
feebly at 110°C. Mixtures with phosphorus and sulfur
explode on impact, hydrogen sulfide ignites on contact,
and antimony trisulfide ignites when ground with the salt.
Mellor, 1941, Vol. 3, 483–485
See other SILVER COMPOUNDS
See related METAL NITRATES
0048 Aluminium
[7429-90-5] Al
Al
HCS 1980, 135 (powder)
Finely divided aluminium powder or dust forms highly
explosive dispersions in air [1], and all aspects of pre-
vention of aluminium dust explosions are covered in 2
US National Fire Codes [2]. The effects on the ignition
properties of impurities introduced by recycled metal used
to prepare dust were studied [3]. Pyrophoricity is elimi-
nated by surface coating aluminium powder with poly-
styrene [4]. Explosion hazards involved in arc and flame
spraying of the powder were analyzed and discussed [5],
and the effect of surface oxide layers on flammability
was studied [6]. The causes of a severe explosion in
1983 in a plant producing fine aluminium powder were
analyzed, and improvements in safety practices discussed
[7]. A number of fires and explosions involving aluminiumdust arising from grinding, polishing, and buffing opera-
tions were discussed, and precautions detailed [8] [12]
[13]. Atomized and flake aluminium powders attain
See other METALS
See other REDUCANTS
0049 Aluminium-cobalt alloy (Raney cobalt alloy)
[37271-59-3] 50:50; [12043-56-0] Al Co; Al—Co
5
[73730-53-7] Al Co
2
Al Co
The finely powdered Raney cobalt alloy is a significant
dust explosion hazard.
See DUST EXPLOSION INCIDENTS (reference 22)
0050 Aluminium–copper–zinc alloy
(Devarda’s alloy)
[8049-11-4] Al—Cu—Zn
Al Cu Zn
Silver nitrate: Ammonia, etc.
See DEVARDA’S ALLOY
See other ALLOYS0051 Aluminium amalgam (Aluminium–
mercury alloy)
[12003-69-9] (1:1) Al—Hg
Al Hg
The amalgamated aluminium wool remaining from prepa-
ration of triphenylaluminium will rapidly oxidize and
become hot upon exposure to air. Careful disposal is nec-
essary [1]. Amalgamated aluminium foil may be pyro-
phoric and should be kept moist and used immediately [2].
1. Neely, T. A. et al., Org. Synth., 1965, 45, 109
2. Calder, A. et al., Org. Synth., 1975, 52, 78
See other ALLOYS

This string contains information on 5 distinct compounds, which are identified by a 4 digit number at the beginning, followed by the name and then in another line the CAS unique identifier in square brackets.

The way i'm trying to divide this into separate substrings for each object is by identifying the 4 digit number which is always followed by the other identifiers and divide the text at that point.

I'm currently using this regex expression which correctly identifies the 4 digit identifiers:

\n(\d{4})\s(?:[\s\S]*?)(?:\[\d*?-\d*?-\d*?\]|\[ *?\] [a-zA-Z]*?)

However, this also includes a few instances of other 4 digit numbers that are not identifiers, such as dates in the body text, such as the date "1983" in the text of the Aluminum (0048) compound entry.

I have tried to use negative lookaheads with the same expression i'm using for isolating the 4 digit identifier, but none of of the ways i've tried worked. And now i'm unsure if this is even possible, or perhaps i'm overcomplicating it.

Another way to do it would be by using the CAS (in the square brackets) but that would be worse, as there are entries with multiple or even empty CAS.

Any advice would be greatly appreciated!

CodePudding user response:

A few notes about your pattern:

  • You can omit [a-zA-Z]*? at the end of the pattern as it is the last part and non greedy so it will not match any characters
  • parts like \d*?-\d*?-\d*? and \[ *?\] don't have to be non greedy as the specified character to be repeated can not cross the following character

If the match should always start with a newline:

\n(\d{4}).*(?:\n\(.*)*\n\[(?: *|\d -\d -\d )]

Explanation

  • \n Match a newline
  • (\d{4}) Capture 4 digits in group 1
  • .* Match the rest of the line
  • (?:\n\(.*)* Optionally repeat matching a newline and ( followed by the rest of the line
  • \n Match a newline
  • \[(?: *|\d -\d -\d )] Match [...] where there can be either only spaces or digits with a hyphen in between

See a regex demo.

If the square brackets should be directly on the next line:

^(\d{4}).*\n\[(?: *|\d -\d -\d )]

See another regex demo.

  • Related