Home > Software design >  Complicated Search and Replace using RegEx
Complicated Search and Replace using RegEx

Time:10-21

I'm trying to convert a bunch of custom "recipes" from an old proprietary format to something that is ultimately compatible with C#. And I think that the easiest way to do this would be to use regular expressions. But I'm having trouble figuring out the expression. The piece that I need to convert with this RegEx is the IF statements. Here are a few examples of the original recipes...

  • IF(A = B,C,D)
  • IF(AA = BB,IF(E=F,G,H),DD)
  • IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)

The first one is straightforward... If A = B then C else D.
The second one is similar, except that the IF statements are nested.
And the third one includes additional ROIND function calls in the results.

I've stumbled across regex101.com and have managed to put together the following pattern which is getting close. It works for the first example, but not for the other two: (.*?)IF[^\S\r\n]*\((.*?),(.*?),(.*?)\)

Ultimately, what I want to do is use a regular expression to turn the three examples above into:

  • if (A == B) { C } else { D }
  • if (AA == BB) { if (E == F) { G } else { H } } else { DD }
  • if (S1 <> R1) { ROUND(ROUND(S2/S1,R2)*S3,R3) } else { R4 }

Note that the whitespace in the results is not particularly important. I just formatted it for readability. Also, the "ROUND" functions will be replaced separately with C# Math.Round() calls. No need to worry about those, here. (All I should need to do to them is add, "Math." and fix the capitalization.)

I'll keep plugging away at this, but if anyone out there has the RegEx experience to figure this out, I would appreciate it.

EDIT: With some additional effort, I've expounded upon my first expression and got it into the following... (.*?)IF[^\S\r\n]*\((.*?),(([^\(]*)|(.*?\(.*?\))),(([^\(]*)|(.*?\(.*?\)))\) And with the following replace expression... $1if($2) {$3} else {$6} I'm almost there. It's just the nested IF statements that are left. (And although I'd prefer to do this with a single pass, if a recursive expression is not going to work, I could rig something up to run the results of the expression through it a second time to deal with the nested IF statements. It's not ideal, but if it's the best I have, I could live with it.

CodePudding user response:

The problem with using regex for parsing arbitrary recursive grammar, is that regex are not particularly suitable for recursion. There is a limited support for recursion in some regex implementation, but it's tricky to make it work for anything slightly more complicated than simple balanced parentheses.

That being said, for your particular case, although at the first sight it appears as recursive grammar, it might be possible to cheat.

In IF(S1<>R1,ROUND(ROUND(S2/S1,R2)*S3,R3),R4)

if it is guaranteed that both S1<>R1 and R4 don't contain comma symbol, then you can use the following regex:

IF\(([^,]*),(.*),([^,] )\)

Try it here: https://regexr.com/67r56

How it works: the first matching group greedily matches everything from the beginning of the string, until it encounters the first comma, then the second group greedily matches everything to the end, and starts backtracking, until the very last comma of the string is "released" from the second group. After that the third group matches the "released tail" of the string.


However, as I mentioned in the comments, if S1, R1 or R4 are expressions themself, this regex trick won't work, and you'd need to use a proper recursive parser. Fortunately, there are plenty of parser/combinator libraries for user defined grammars (or you might even find one that already works for your grammar). When your expression is parsed into AST, it's fairly easy to transform it into the desired form.

Alternatively, you can look into writing your own simple parser. It should be fairly straightforward, as you only care about nested parentheses and commas.

  • Related