Home > Blockchain >  Javascript RegEx: /(\b\S \b)(?=.*\b\1\b)/ to remove duplicated word fails sometime when (-,/)
Javascript RegEx: /(\b\S \b)(?=.*\b\1\b)/ to remove duplicated word fails sometime when (-,/)

Time:12-08

I checked several posts related to removing duplicated words (in my case word means a sub-string separated by a space) in javascript in a String. The following one RegEx: /(\b\S \b)(?=.*\b\1\b)/g is among the ones I found on the internet that matches almost all cases but it produces some mismatches that I am not able to find out why. For example, it removes some characters such as: , /- in situations where it is part of the string (not reached a blank yet). I guess it has to be with the boundary metacharacter \b but I am not able to find a solution for that.

For example, I have the following string samples:

123-1 123-2 test-1 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b a/b
Inv 50049 50049 Inv 50195 PrjPAN02
Inv 51360-1, 51366-7; 51372 Inv 51360-1, 51366-7; 372 PrjPAN02
Inv 51360-1, 51366-7; 51372 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020

and it generates the following output. You can test it here: Regex101

1232  test-1 we 1004/20
Company we 0906/20 083020-090620
ab /01
test_1 test_2
 a/b
  50049 Inv 50195 PrjPAN02
 , ; 51372 Inv 513601, 51366-7; 372 PrjPAN02
 513601, ;  51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017,  55001, 55022, 55025
55254, 61 5524666,69
55733, 41, 44 55727, 45,48
57269, 7174,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
we 09/20 091320-092020

I would expect the following output:

123-1 123-2 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b
50049 Inv 50195 PrjPAN02
51372 Inv 51360-1, 51366-7; 372 PrjPAN02
51360-1, 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020

I would expect that every repeated string delimited by space would be removed, but the ReEx removes the slash (/) and hyphen (-) and comma (,) in some cases inside strings that are delimited by space.

I checked the following similar question, to try to find regular expressions that would match all the cases:

  1. Javascript RegExp Word boundaries unicode characters
  2. Remove duplicate words in a string using Regex JS [duplicate]
  3. Regular expression to find and remove duplicate words

CodePudding user response:

Word boundaries do not work here. Use

/(?<!\S)(\S )(?!\S)(?=.*(?<!\S)\1(?!\S))/g

EXPLANATION

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
      \S                       non-whitespace (all but \n, \r, \t,
                               \f, and " ")
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
    \1                       what was matched by capture \1
--------------------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
      \S                       non-whitespace (all but \n, \r, \t,
                               \f, and " ")
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
  )                        end of look-ahead
  • Related