i have string like
$string = "hello this is a string and hello but this is a string hello but this is a string and ";
in it there is repeated words and repeated sentences but i only want the sentences so i expect
hello but this is a string
to be captured
i tried using this regex (.{10,}).*?\1
but it got me this is a string and
but i want to get hello but this is a string
because it is the most letters from 10
without making it {25,}
to match more only
but it is also very very slow
- Cary Swoveland: my plan is to capture longest string repeated and remove it from the string and leaving only one so in my example it would be
hello this is a string and hello but this is a string and
CodePudding user response:
Collect all substrings that start with a word char at a word boundary position and get the longest one using an extra step (as it is impossible to do with plain regex):
$string = "hello this is a string and hello but this is a string hello but this is a string and ";
if (preg_match_all('~(?=(\b(\w.{9,})(?=.*?\b\2)))~u', $string, $m)) {
echo array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; });
}
// => hello but this is a string
See the PHP demo. See the regex demo.
Note: if you plan to limit the length of the matches to 25 chars, use '~(?=(\b(\w.{9,24})(?=.*?\b\2)))~u'
.
Details:
(?=
- start of a positive lookahead:(
- Group 1:\b
- word boundary -(\w.{9,})
- a word char and then nine or more chars other than line break chars(?=.*?\b\2)
- a positive lookahead that requires any zero or more chars other than line break chars as few as possible and then the same string as captured in Group 2 preceded with a word boundary
)
- end of Group 1
)
- end of lookahead.
We only get the longest string from the $m[1]
array using array_reduce($m[1], function ($a, $b) { return strlen($a ?? '') > strlen($b ?? '') ? $a : $b; })
.
CodePudding user response:
I wish to suggest an algorithm to compute longest repeated substring of a given string that is comprised of words and is not preceded nor followed by a word character.
The approach is straightforward: condition on the word in the string that is the first word of the substring. I initially remove non-word characters at the beginning of the string, then find the longest repeating string that begins at the start of that string.
Next, regardless of whether a repeating string was found, a new string is formed by removing the first word of the modified string as well as following non-word characters. The process is repeated, at each step where a repeating string is found the length of that string being compared to the length of the previously longest-known repeating string.
This algorithm can be implemented in Ruby as follows. I realize that a PHP solution is desired, of course, but I do not know PHP. However, the Ruby code reads much like pseudo-code. Perhaps an inspired reader would be interested in converting it to PHP.
RGX = /\A(.*\w)\b(?=.*\b\1\b)/
def longest_repeating(str)
longest = { str: '', len: 0 } # best solution known so far
loop do # loop until breaking out by returning
i = (str =~ /\w/) # index of first word char if present
return longest if i.nil? # no more words to examine
str = str[i..-1] # remove first i characters
s = str[RGX] # obtain string matched by RGX
if s # match found
n = s.length # update longest if new longest found
longest = { str: s, len: n } if n > longest[:len]
end
str = str[/ .*/] # remove leading spaces from str
end
end
longest_repeating "hello this is a string and hello but this is a string hello but this is a string and "
#=> {:str=>"hello but this is a string", :len=>26}
longest_repeating "aaa bbb ccc ddd bbb ccc ddd eee fff fff ggg hhh iii jjj kkk lll eee fff fff ggg hhh iii jjj"
#=> {:str=>"eee fff fff ggg hhh iii jjj", :len=>27}
The regular expression held by RGX
can be broken down as follows.
\A # match beginning of string
(.*\w) # match zero or more characters followed by a word
# characters, save to capture group 1
\b # match a word boundary
(?= # begin a positive lookahead
.* # match zero or more characters
\b\1\b # match the content of capture group 1 with word boundaries
)
Note that the code ensures that the first character matched by (.*\w)
is a word character.