How to find all repeating parts of one text python-CodePudding

I have a text, for example:

a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''

Now I would like to find the repeating parts of this text of a sufficient length. For example, let us set the limit to 4 (so we are looking for strings longer than 4), so the name Joe is left out. So we should have: ['noname podcast', 'joining us today'].

I had an idea of using the difflib for this, but it only works by comparing two texts, so I tried feeding it the same text two times and picking sequences which will appear more than twice with difflib.SequenceMatcher, but it just returns one sequence which is the whole text (not very surprisingly, really).

What would be the correct way to approach this?

CodePudding user response：

Here is some proposal. This operates at the word level using a simple split but you can tune this to sentences or any other split depending on your interests. Also importantly, you'd need some preprocessing (I just cleaned for dots but in a large text more should be done).

a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''

mylist=a.split()

# Some preprocessing:
mylist=[i.replace('.','') for i in mylist  ]

from collections import Counter
my_dict=Counter(mylist)


[i for i in mylist if len(i)>4 and my_dict[i]>1]

CodePudding user response：

using nltk

from collections import Counter
import nltk

a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''

# function to replace words
def word_remover(string, noise_list):
    for v in noise_list:
        string = string.replace(v, "")
    return string


data_split = nltk.word_tokenize(a)
repeat_words = [k for k, v in Counter(data_split).items() if v >= 2 and len(k) > 1]
get_index = [data_split.index(i) for i in repeat_words]

find_seq = []
for i in get_index:
    if i == get_index[-1]:
        break
    if i   1 in get_index:
        find_seq.append(data_split[i])
        find_seq.append(data_split[i   1])

duplicates = [i for i in repeat_words if i in find_seq or len(i) > 3]
print(duplicates)
print(word_remover(a, duplicates))

>>>>['noname', 'podcast', 'joining', 'us', 'today']

>>>>- Hello, this is a   and    is Joe.
>>>>- Who's Joe? you would ask...
>>>>...
>>>>- Anyway, thank you for listening to the  . Special thanks to Joe, who was   .