Find identical phrases in strings-CodePudding

I have sets of texts. The texts are rather short (up to 500 characters). The sets consist of up to 10 texts. I try to find phrases as long as possible which occur in most of the texts. In other words I am looking for identical substrings. The longer the better and the more texts they occur in also the better.

Example:

"The red brown fox jumps over the lazy dog"
"The red haired girl smokes brown cigars"
"Where the yellow fox jumps over the haywire"
"All the boys like a red brown fox"
"A girl like a fox jumps over the dead boys grave"

Phrases (one word phrases ommitted):

"red brown fox", length 12, occurence 2
"fox jumps over the", length 18, occurrence 3
"The red", length 7, occurrence 2

Phrases like "brown fox" or "fox jumps" are omitted, because they are subphrases of longer phrases.

I am looking for an algorithm to find those phrases.

CodePudding user response：

Finding the longuest commun substring is a commun DP algorithm explained pretty well here. https://www.geeksforgeeks.org/longest-common-substring-dp-29/ . After to find the occurence of the strings in the set of text you can simply use a code loke this.

substring = "red brown fox"
n = 0
for text in texts:
    if substring in text:
       n = n   1
print(substring, n)

CodePudding user response：

As every substring is a prefix of some suffix, traversing a generalised suffix tree ought to let us compare paths both by how many leaves from different texts they share, indicating quantity of texts sharing a substring, as well as how long the distances to the leaves, indicating shared substring lengths.