For the string below:
Which one of following pairs of gases is the major cause of greenhouse effect?
A. \( C O_{2} \) and \( O_{3} \)
в. \( C O_{2} \) and \( C O \)
c. \( C F C \) and \( S O_{2} \)
D. \( C O_{2} \) and \( N_{2} O \)
I want something like:
Which one of following pairs of gases is the major cause of greenhouse effect?
A. \( CO2 \) and \( O3 \)
в. \( CO2 \) and \( CO \)
c. \( CFC \) and \( SO2 \)
D. \( CO2 \) and \( N2O \)
I used re.sub('[A-Z]_{[0-9]}', '<CHEM>', text)
as an experiment so that I could combine the two. How could I combine the whole equation together? Each element is separated by a space and each element could be capitl letters and/or made of 1 or more alphabets. It could be something like:
\( Na Cl_{2} \) and \( Fe k_{3} cl \)
-> \( NaCl2 \) and \( Fek3cl \)
CodePudding user response:
You can use capture groups with re.sub:
re.sub(r'([A-Z][a-z]?)(_{([0-9] )})? *', r'\1\3', text)
If you want to preserve the whitespace after the last element, you can use
re.sub(r'([A-Z][a-z]?)(_{([0-9] )})?( *(?=[A-Z]))?', r'\1\3', text)
Explanation:
([A-Z][a-z]?)(_{([0-9] )})? *
([A-Z][a-z]?) # Matches chemical names. Captures the name of the chemical in group 1.
(_{([0-9] )})? # Matches a potential subscript. Captures the number in group 3.
* # Matches trailing whitespace. This causes it to be removed
( *(?=[A-Z]))? # Alternatively, match the whitespace, only if it's followed by a capital letter. This means that it will be removed only if it's followed by a chemical element.
CodePudding user response:
You may write
rgx = r'(?<!\\\()[ _{}](?=[ A-Z\d _{}]* \\\))'
re.sub(rgx, '', str)
The regular expression can be broken down as follows.
(?<! # begin a negative lookbehind
\\\( # match '\('
) # end negative lookbehind
[ _{}] # match a character in the char class
(?= # begin a positive lookahead
[ A-Z\d _{}]* # match zero or more characters in the char class
[ ]\\\) # match ' \)'
) # end positive lookahead
I've put the space character in a character class ([ ]
) merely to make it visible.
CodePudding user response:
You can use
import re
text = r"\( Na Cl_{2} \) and \( Fe k_{3} cl \)"
print( re.sub(r'\\\(\s*([^()]*?)\s*\\\)', lambda x: f'\\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \\)', text) )
See the Python demo, see the regex demo. Details:
\\
- a\
char\(
- a(
char\s*
- zero or more whitespaces([^()]*?)
- Group 1: any zero or more chars other than)
,(
\s*\\\)
- zero or more whitespaces, and then a\)
string.
The lambda x: f'\\( {"".join(c if c.isalnum() else "" for c in x.group(1))} \\)'
replacement replaces the match with \(
, Group 1 with all non-alphanumeric chars removed and \)
.