How to find and combine the Whole Chemical Compound string using regex within a LaTex-CodePudding

For the string below:

Which one of following pairs of gases is the major cause of greenhouse effect?
A. \( C O_{2} \) and \( O_{3} \)
в. \( C O_{2} \) and \( C O \)
c. \( C F C \) and \( S O_{2} \)
D. \( C O_{2} \) and \( N_{2} O \)

I want something like:

Which one of following pairs of gases is the major cause of greenhouse effect?
A. \( CO2 \) and \( O3 \)
в. \( CO2 \) and \( CO \)
c. \( CFC \) and \( SO2 \)
D. \( CO2 \) and \( N2O \)

I used re.sub('[A-Z]_{[0-9]}', '<CHEM>', text) as an experiment so that I could combine the two. How could I combine the whole equation together? Each element is separated by a space and each element could be capitl letters and/or made of 1 or more alphabets. It could be something like:

\( Na Cl_{2} \) and \( Fe k_{3} cl \) -> \( NaCl2 \) and \( Fek3cl \)

CodePudding user response：

You can use capture groups with re.sub:

re.sub(r'([A-Z][a-z]?)(_{([0-9] )})? *', r'\1\3', text)

Try it online!

If you want to preserve the whitespace after the last element, you can use

re.sub(r'([A-Z][a-z]?)(_{([0-9] )})?( *(?=[A-Z]))?', r'\1\3', text)

Try it online!

Explanation:

([A-Z][a-z]?)(_{([0-9] )})? *
([A-Z][a-z]?)                             # Matches chemical names. Captures the name of the chemical in group 1.
             (_{([0-9] )})?               # Matches a potential subscript. Captures the number in group 3.
                            *             # Matches trailing whitespace. This causes it to be removed
                           ( *(?=[A-Z]))? # Alternatively, match the whitespace, only if it's followed by a capital letter. This means that it will be removed only if it's followed by a chemical element.

CodePudding user response：

You may write

rgx = r'(?<!\\\()[ _{}](?=[ A-Z\d _{}]* \\\))'

re.sub(rgx, '', str)

Demo

The regular expression can be broken down as follows.

(?<!            # begin a negative lookbehind
  \\\(          # match '\('          
)               # end negative lookbehind
[ _{}]          # match a character in the char class
(?=             # begin a positive lookahead
  [ A-Z\d _{}]* # match zero or more characters in the char class
  [ ]\\\)       # match ' \)'
)               # end positive lookahead

I've put the space character in a character class ([ ]) merely to make it visible.

CodePudding user response：

You can use

import re
text = r"\( Na Cl_{2} \) and \( Fe k_{3} cl  \)"
print( re.sub(r'\\\(\s*([^()]*?)\s*\\\)', lambda x: f'\\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \\)', text) )

See the Python demo, see the regex demo. Details:

\\ - a \ char
\( - a ( char
\s* - zero or more whitespaces
([^()]*?) - Group 1: any zero or more chars other than ), (
\s*\\\) - zero or more whitespaces, and then a \) string.

The lambda x: f'\\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \\)' replacement replaces the match with \( , Group 1 with all non-alphanumeric chars removed and \).