Home > other >  Remove duplicates keeping the last occurrence
Remove duplicates keeping the last occurrence

Time:11-08

Given this haystack and Regex PCRE2(PHP>=7.3):

#1       #2      #3
green          [foo] [foo1]
red             [foo]
blue                  [foo] [foo1] [foo2]
yellow             [foo2]
green          [foo]
green          [foo] [foo1]
red             [foo]
pink                  [foo3]

Where:

#1 is always a string that can contain numbers but no spaces.

#2 is always a random amount of space between #1 and 3.

#3 same as #1 but inside of brackets [ ] and can exist multiple brackets.

I'm trying to remove all lines containing dupes on #1 but keeping the last dupe line found.

It would look like:

blue                  [foo] [foo1] [foo2]
yellow             [foo2]
green          [foo] [foo1]
red             [foo]
pink                  [foo3]

Cleared all lines that contain the same string on #1 keeping only the last.

And the lines that don't contain dupes on #1 as for example:

pink [foo3] keep them.

I tried to explain it in the most detail possible, let me know if it is still unclear or if it's not possible with regex.

CodePudding user response:

You can convert matches of the following regular expression (with flags g, m and i) to empty strings:

^([a-z\d]).*\n(?![\s\S]*\b^\1\b)

The flag g prevents returning after the first match, m (multiline) causes ^ and $ to match the beginning and end of lines rather than the beginning and end of the string, and i makes matches case insensitive.

Demo

The elements of the expression are as follows:

^             # match beginning of line
([a-z\d])     # match one or more letters or digits and save to capture group 1
.*            # match zero or more characters other than newlines
\n            # match linefeed
(?!           # begin negative lookahead
  [\s\S]*     # match zero or more characters including line terminators
  \b^\1\b     # match content of group 1 with word breaks before and after
)             # end negative lookahead

Note that . matches carriage returns \r. If the last line may not end with a line feed change \n to (?:\n|$).


If you wish to identify any strings that do not possess the required format you can use the following regular expression to match incorrectly-formatted lines:

^(?![a-z\d]*(?: *\[[^[\]\r\n]*\]) \r?\n).*

Demo

Hover your cursor over each element of the expression at the link to obtain an explanation of the function of that element.

CodePudding user response:

You could use

^(\S )\h \[\S*\](?!\S).*$(?![\s\S]*^\1)
  • ^ Start of string
  • (\S ) Capture group 1
  • \h Match 1 spaces
  • \[\S*\](?!\S) Match from an opening [ till closing ] and assert a whitespace boundary to the right to not match [foo]a
  • .*$ Match the rest of the line
  • (?![\s\S]*^\1) Negative lookahead, assert that capture group 1 does not occur anymore in the text

See a regex demo | PHP demo.

For example

$re = '/^(\S )\h \[\S*\](?!\S).*$(?![\s\S]*^\1)/m';
$str = 'green          [foo] [foo1]
red             [foo]
blue                  [foo] [foo1] [foo2]
yellow             [foo2]
green          [foo]
green          [foo] [foo1]
red             [foo]
pink                  [foo3]';

preg_match_all($re, $str, $matches);
print_r($matches[0]);

Output

Array
(
    [0] => blue                  [foo] [foo1] [foo2]
    [1] => yellow             [foo2]
    [2] => green          [foo] [foo1]
    [3] => red             [foo]
    [4] => pink                  [foo3]
)
  • Related