Home > Software engineering >  Replace spaces between specific characters only using regex
Replace spaces between specific characters only using regex

Time:08-19

I am trying to replace whitespaces, in latex that is contained in a markdown document, with \\; using regex.
In the md package I'm using, all latex is wrapped in either $ or $$

I would like to change the following from

"dont edit this $result= \frac{1}{4}$ dont edit this $$some result=123$$"

to this

"dont edit this $result=\\;\frac{1}{4}$ dont edit this $$some\\;result=123$$"

I have managed to do it using the messy function below but would like to use regex for a cleaner approach. Any help would be appreciated

import re
vals = r"dont edit this $result= \frac{1}{4}$ dont edit this $$some result=123$$"
def cleanlatex(vals):
    vals = vals.replace(" ", "  ")
    char1 = r"\$\$"
    char2 = r"\$"
    indices = [i.start() for i in re.finditer(char1, vals)]
    indices  = [i.start() for i in re.finditer(char2, vals.replace("$$","~~"))]

    indices.sort()
    print(indices)
    # check that no of $ or $$ are even
    if len(indices) % 2 == 0:
        while indices:
            start = indices.pop(0)
            finish = indices.pop(0)
            vals = vals[:start]   vals[start:finish].replace('  ', '\;')   vals[finish:]
    
    vals = vals.replace("  ", " ")
    return vals

print(cleanlatex(vals))

Output:

[18, 39, 60, 78]   
dont edit this $result=\\;\frac{1}{4}$ dont edit this $$some\\;result=123$$

CodePudding user response:

With regex I would still do it in two steps:

  • Identify the parts between dollars (or double dollars) using regex
  • Within those parts, replace spaces with a simple replace call
def cleanlatex(vals):
    return re.sub(r"(\$\$?)(.*?)\1", lambda m: m[0].replace(" ", r"\;"), vals)  

If the dollars don't match up, this will still make replacements, up until no more pair of matching dollars is found. This is a different behaviour from how your code works where nothing is replaced when the dollars don't match.

When dollars are "nested", like in "$$nested $ here$$", then the inner dollar will not be regarded as a delimiter in this solution. Or if a double dollar happens to follow a single dollar, the double one will be interpreted as two single dollars that just happen to follow each other. So "$part one$$part two$" will identify two parts, each delimited with a single dollar.

Your question didn't give any such boundary conditions (there are quite a few of them), so the solution may need some adaptations.

CodePudding user response:

I never thought of lambda! Thank you @trincot your answer covers things I didn't even know were possible with regex. I am trying to decipher the pattern and would love some clarification if you can? I'd really appreciate it as I've had a look at re docs but am still confused by the following

  1. is there a reason to use ($$?) over ($ )?
  2. \1 -> is this just a way to keep the pattern tidy and if I used \2 it would replicate the second capture group?
  3. does the ? in (.*?) make it find the shortest string that matches pattern?
  4. Why m[0] ie why index at 0

Thanks again for the reply

  • Related