Home > Software engineering >  regex matching duplicates in a comma separated list
regex matching duplicates in a comma separated list

Time:07-28

I'm trying to regex match any duplicate words (i.e. alphanumeric and can have dashes) in some yaml with a PCRE tool

I have found [1] a consecutive, duplicate regex matcher:

(?<=,|^)([^,]*)(,\1) (?=,|$)

it will catch

hello-world,hello-world,goodbye-world,goodbye-world

but not the "hello-world"s in

hello-world,goodbye-world,goodbye-world,hello-world

Could someone help me try to build a regex pattern for the second case (or both cases)?

[1] - https://www.regular-expressions.info/duplicatelines.html

CodePudding user response:

Put an optional ,.* between the capture group and the back-reference.

(?<=,|^)([^,]*)(?:,.*)?(,\1)(?=,|$)

DEMO

CodePudding user response:

You may use this regex:

(?<=^|,)([^,] )(?=(?>,[^,]*)*,\1(?>,|$))(?=,|$)

RegEx Demo

RegEx Details:

  • (?<=^|,): Assert that we have , or start position before current position
  • ([^,] ): Match 1 of non-comma text and capture in group #1
  • (?=(?>,[^,]*)*,\1(?>,|$)): Lookahead to assert presence of same value we captured in group #1 ahead of us
  • (?=,|$): Assert that we have , or end position ahead
  • Related