Regex to find repeated sentences from more to less-CodePudding

i have string like

$string = "hello this is a string and hello but this is a string hello but this is a string and ";

in it there is repeated words and repeated sentences but i only want the sentences so i expect

hello but this is a string to be captured

i tried using this regex (.{10,}).*?\1 but it got me this is a string and

but i want to get hello but this is a string because it is the most letters from 10 without making it {25,} to match more only

but it is also very very slow

Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be

hello this is a string and hello but this is a string and

CodePudding user response：

Collect all substrings that start with a word char at a word boundary position and get the longest one using an extra step (as it is impossible to do with plain regex):

$string = "hello this is a string and hello but this is a string hello but this is a string and ";
if (preg_match_all('~(?=(\b(\w.{9,})(?=.*?\b\2)))~u', $string, $m)) {
    echo array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; });
}
// => hello but this is a string

See the PHP demo. See the regex demo.

Note: if you plan to limit the length of the matches to 25 chars, use '~(?=(\b(\w.{9,24})(?=.*?\b\2)))~u'.

Details:

(?= - start of a positive lookahead:
- ( - Group 1:
  - \b - word boundary -(\w.{9,}) - a word char and then nine or more chars other than line break chars
  - (?=.*?\b\2) - a positive lookahead that requires any zero or more chars other than line break chars as few as possible and then the same string as captured in Group 2 preceded with a word boundary
- ) - end of Group 1
) - end of lookahead.

We only get the longest string from the $m[1] array using array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; }).

CodePudding user response：

I wish to suggest an algorithm to compute longest repeated substring of a given string that is comprised of words and is not preceded nor followed by a word character.

The approach is straightforward: condition on the word in the string that is the first word of the substring. I initially remove non-word characters at the beginning of the string, then find the longest repeating string that begins at the start of that string.

Next, regardless of whether a repeating string was found, a new string is formed by removing the first word of the modified string as well as following non-word characters. The process is repeated, at each step where a repeating string is found the length of that string being compared to the length of the previously longest-known repeating string.

This algorithm can be implemented in Ruby as follows. I realize that a PHP solution is desired, of course, but I do not know PHP. However, the Ruby code reads much like pseudo-code. Perhaps an inspired reader would be interested in converting it to PHP.

RGX = /\A(.*\w)\b(?=.*\b\1\b)/

def longest_repeating(str)
  longest = { str: '', len: 0 } # best solution known so far
  loop do                       # loop until breaking out by returning
    i = (str =~ /\w/)           # index of first word char if present
    return longest if i.nil?    # no more words to examine
    str = str[i..-1]            # remove first i characters  
    s = str[RGX]                # obtain string matched by RGX
    if s                        # match found
      n = s.length              # update longest if new longest found
      longest = { str: s, len: n } if n > longest[:len]
    end
    str = str[/ .*/]            # remove leading spaces from str
  end
end

longest_repeating "hello this is a string and hello but this is a string hello but this is a string and "
  #=> {:str=>"hello but this is a string", :len=>26}

longest_repeating "aaa bbb ccc ddd bbb ccc ddd eee fff fff ggg hhh iii jjj kkk lll eee fff fff ggg hhh iii jjj"           
  #=> {:str=>"eee fff fff ggg hhh iii jjj", :len=>27}

The regular expression held by RGX can be broken down as follows.

\A        # match beginning of string
(.*\w)    # match zero or more characters followed by a word
          # characters, save to capture group 1
\b        # match a word boundary
(?=       # begin a positive lookahead
  .*      # match zero or more characters
  \b\1\b  # match the content of capture group 1 with word boundaries
)

Note that the code ensures that the first character matched by (.*\w) is a word character.