I have sets of texts. The texts are rather short (up to 500 characters). The sets consist of up to 10 texts. I try to find phrases as long as possible which occur in most of the texts. In other words I am looking for identical substrings. The longer the better and the more texts they occur in also the better.
Example:
- "The red brown fox jumps over the lazy dog"
- "The red haired girl smokes brown cigars"
- "Where the yellow fox jumps over the haywire"
- "All the boys like a red brown fox"
- "A girl like a fox jumps over the dead boys grave"
Phrases (one word phrases ommitted):
- "red brown fox", length 12, occurence 2
- "fox jumps over the", length 18, occurrence 3
- "The red", length 7, occurrence 2
Phrases like "brown fox" or "fox jumps" are omitted, because they are subphrases of longer phrases.
I am looking for an algorithm to find those phrases.
CodePudding user response:
Finding the longuest commun substring is a commun DP algorithm explained pretty well here. https://www.geeksforgeeks.org/longest-common-substring-dp-29/ . After to find the occurence of the strings in the set of text you can simply use a code loke this.
substring = "red brown fox"
n = 0
for text in texts:
if substring in text:
n = n 1
print(substring, n)
CodePudding user response:
As every substring is a prefix of some suffix, traversing a generalised suffix tree ought to let us compare paths both by how many leaves from different texts they share, indicating quantity of texts sharing a substring, as well as how long the distances to the leaves, indicating shared substring lengths.