I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)
0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511
I am trying to build a regex that removes the duplicate code numbers from each row.
I tested \b(\d )\b.*\b\1\b
from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be
0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521|0512
Appreciate your help
CodePudding user response:
No idea what regex engine this knime uses.
Probably you need one that supports variable length lookbehind to do it in one pass, eg. .NET
\|(\d )\b(?<=\b\1\b.*?\1)
See this demo at Regexstorm (check [•] replace matches with, click on "context")
0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521
With a lookahead you can get unique rows too, but vice versa (not like your desired results)
\b(\d )\|(?=.*?\b\1\b)
0613|061|0612
0612|021|0212|0211
0111
0521|0512|0511
CodePudding user response:
Based on the expected output shown, you can use this regex:
(\|\d )\1(?:((?:\|\d )*)\1)?(?=\||$)|^(\d )\|(?=\3\b)
Replacement string is: $2