I am trying to replace whitespaces, in latex that is contained in a markdown document, with \\;
using regex.
In the md package I'm using, all latex is wrapped in either $
or $$
I would like to change the following from
"dont edit this $result= \frac{1}{4}$ dont edit this $$some result=123$$"
to this
"dont edit this $result=\\;\frac{1}{4}$ dont edit this $$some\\;result=123$$"
I have managed to do it using the messy function below but would like to use regex for a cleaner approach. Any help would be appreciated
import re
vals = r"dont edit this $result= \frac{1}{4}$ dont edit this $$some result=123$$"
def cleanlatex(vals):
vals = vals.replace(" ", " ")
char1 = r"\$\$"
char2 = r"\$"
indices = [i.start() for i in re.finditer(char1, vals)]
indices = [i.start() for i in re.finditer(char2, vals.replace("$$","~~"))]
indices.sort()
print(indices)
# check that no of $ or $$ are even
if len(indices) % 2 == 0:
while indices:
start = indices.pop(0)
finish = indices.pop(0)
vals = vals[:start] vals[start:finish].replace(' ', '\;') vals[finish:]
vals = vals.replace(" ", " ")
return vals
print(cleanlatex(vals))
Output:
[18, 39, 60, 78]
dont edit this $result=\\;\frac{1}{4}$ dont edit this $$some\\;result=123$$
CodePudding user response:
With regex I would still do it in two steps:
- Identify the parts between dollars (or double dollars) using regex
- Within those parts, replace spaces with a simple
replace
call
def cleanlatex(vals):
return re.sub(r"(\$\$?)(.*?)\1", lambda m: m[0].replace(" ", r"\;"), vals)
If the dollars don't match up, this will still make replacements, up until no more pair of matching dollars is found. This is a different behaviour from how your code works where nothing is replaced when the dollars don't match.
When dollars are "nested", like in "$$nested $ here$$", then the inner dollar will not be regarded as a delimiter in this solution. Or if a double dollar happens to follow a single dollar, the double one will be interpreted as two single dollars that just happen to follow each other. So "$part one$$part two$" will identify two parts, each delimited with a single dollar.
Your question didn't give any such boundary conditions (there are quite a few of them), so the solution may need some adaptations.
CodePudding user response:
I never thought of lambda! Thank you @trincot your answer covers things I didn't even know were possible with regex. I am trying to decipher the pattern and would love some clarification if you can? I'd really appreciate it as I've had a look at re docs but am still confused by the following
- is there a reason to use ($$?) over ($ )?
- \1 -> is this just a way to keep the pattern tidy and if I used \2 it would replicate the second capture group?
- does the ? in (.*?) make it find the shortest string that matches pattern?
- Why m[0] ie why index at 0
Thanks again for the reply