Home > Blockchain >  Regex match repeated or similar lines in file
Regex match repeated or similar lines in file

Time:10-18

I'm trying to remove duplicated or similar lines, but I want to leave unselected only the last match, all duplicated or similar lines should be selected.

This is the text I want to clean (ignore line number only to show at what line I'm referring):

l1:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris
l2:
l3:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information". The painter and critic Maurice Denis shared a sense of bewilderment about Cézanne's revoluti
l4:
l5:Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l6:
l7:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".
l8:
l9:He overturned centuries of theories about how the eye works by depicting a world constantly in motion, affected by the passing of time and infused with the artist's own memories and emotions.
l10:
l11:In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".

In this example, I want only last coincidence unselected in line 11, last line with this text

In 1881, Paul Gauguin joked about how to extract Cézanne's mysterious methods, instructing Camille Pissarro to "ply him with one of those mysterious homeopathic drugs and come straight to Paris to share the information".

lines 1, 3, 5, 7 have some similar text or same text that should match the regex or be selected, the text on the line could be any text until new line and should detect more of this examples in the file.

I'm using this regex but is not working at all, only select l1 and l7 but should be select also l3 and l5 here is the example https://regex101.com/r/gd0Z3V/1:

(?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

CodePudding user response:

The main problem here is that regex doesn't understand human logic. "It looks the same" does not exist in regex. So the first requirement is to translate human logic to regex logic.

We can do that by specifying how many characters we want to be exactly the same to consider it a match.

Here I choose 100 characters. (You can of course change that, but it works with your example text).

Now we can build a regex that matches the whole line if 100 characters in that line is repeated further down the text:

/^.*(.{100}).*$(?=[\s\S] \1)/gm

Explanation:

^.* - match from start of line zero or more characters

(.{100}) - create group 1, matching 100 characters

.*$ - match the rest of the line

(?=[\s\S] \1) - look ahead for one or more of ANY character (including newline) followed be the text matched in group 1.

The result is that the whole line is matched, if 100 characters are repeated further down.

I have created a test case for you here: JSRegExpBuilder (it uses javascript but should work in most flavors).

  • Related