So I have some medium-length string - somewhere between a few words and a few sentences. Sometimes, a substring in the text is repeated twice in a row. I need to write automatic code to identify the repeated part. Or at least flag it with a high probability.
What I know:
- The repeated substring is a series of a few whole words (and punctuation marks). A repeat will not happen in the middle of a word.
- The repeat is of a variable length. It can be a few words to a few sentences itself. But it's always at least a few words long. I would like to avoid flagging single word repetitions if possible.
- When a repeat happens, it's always repeated exactly once, and right after the previous appearence. right after the previous appearence. (<- example)
- I need to run this check on about a million different strings, so the code has to be somewhat efficient at least (not the brute force check-every-option approach).
I've been struggling with this for a while now. Would really appreciate your help.
CodePudding user response:
Since the repetition of one word is a subclass of a multiple-word repetition, it's already helpful to match single words or word-like sequences. Here is the regular expression I tried on your question in an editor with regex search:
(\<\w.{3,16}\w\>).{2,}\1
This is the first repetition found
The repeat is of a variable length. It can be a few words to a few sentences itself. But it's always at least a few words long. I would like to avoid flagging single word repetitions if possible.
But it next finds repeat
in repeating
. So we have to tune the limits.
The part (\<\w.{3,16}\w\>)
means
- from word start (including a character)
- 3 to 16 arbitrary characters
- before word end (including a character)
In other words, one or more word with a total character count of 5 to 18.
The part .{2,}\1
means
- at least two characters
- no upper limit
- captured match
Here, the lower limit can be higher. An upper limit should be tried, especially on longer text.
I'd think that starting with finding short character sequences which repeat, then refine by looking for longer sequences that repeat in the result of the first step (plus additional characters at the end).
It's also a matter of preprocessing. I'd guess that repeating multiple-word sequences should be missed if line breaks (instead of space occur) on different places.
To automate this further, you may switch to Python's re
module.