I checked several posts related to removing duplicated words (in my case word means a sub-string separated by a space) in javascript in a String. The following one RegEx: /(\b\S \b)(?=.*\b\1\b)/g
is among the ones I found on the internet that matches almost all cases but it produces some mismatches that I am not able to find out why. For example, it removes some characters such as: , /-
in situations where it is part of the string (not reached a blank yet). I guess it has to be with the boundary metacharacter \b
but I am not able to find a solution for that.
For example, I have the following string samples:
123-1 123-2 test-1 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b a/b
Inv 50049 50049 Inv 50195 PrjPAN02
Inv 51360-1, 51366-7; 51372 Inv 51360-1, 51366-7; 372 PrjPAN02
Inv 51360-1, 51366-7; 51372 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020
and it generates the following output. You can test it here: Regex101
1232 test-1 we 1004/20
Company we 0906/20 083020-090620
ab /01
test_1 test_2
a/b
50049 Inv 50195 PrjPAN02
, ; 51372 Inv 513601, 51366-7; 372 PrjPAN02
513601, ; 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55001, 55022, 55025
55254, 61 5524666,69
55733, 41, 44 55727, 45,48
57269, 7174,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
we 09/20 091320-092020
I would expect the following output:
123-1 123-2 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b
50049 Inv 50195 PrjPAN02
51372 Inv 51360-1, 51366-7; 372 PrjPAN02
51360-1, 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020
I would expect that every repeated string delimited by space would be removed, but the ReEx removes the slash (/
) and hyphen (-
) and comma (,
) in some cases inside strings that are delimited by space.
I checked the following similar question, to try to find regular expressions that would match all the cases:
- Javascript RegExp Word boundaries unicode characters
- Remove duplicate words in a string using Regex JS [duplicate]
- Regular expression to find and remove duplicate words
CodePudding user response:
Word boundaries do not work here. Use
/(?<!\S)(\S )(?!\S)(?=.*(?<!\S)\1(?!\S))/g
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t,
\f, and " ")
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
) end of look-ahead