Home > Back-end >  Regex to remove duplicate numbers from a string
Regex to remove duplicate numbers from a string

Time:06-19

I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)

0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511

I am trying to build a regex that removes the duplicate code numbers from each row. I tested \b(\d )\b.*\b\1\b from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be

0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521|0512

Appreciate your help

CodePudding user response:

No idea what regex engine this knime uses.

Probably you need one that supports variable length lookbehind to do it in one pass, eg. .NET

\|(\d )\b(?<=\b\1\b.*?\1)

See this demo at Regexstorm (check [•] replace matches with, click on "context")

0612|0613|061
​0211|0612|021|0212
​0111
​0511|0512|0521


With a lookahead you can get unique rows too, but vice versa (not like your desired results)

\b(\d )\|(?=.*?\b\1\b)

Another demo on regex101

0613|061|0612
0612|021|0212|0211
0111
0521|0512|0511

CodePudding user response:

Based on the expected output shown, you can use this regex:

(\|\d )\1(?:((?:\|\d )*)\1)?(?=\||$)|^(\d )\|(?=\3\b)

Replacement string is: $2

RegEx Demo

  • Related