I have a text, for example:
a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''
Now I would like to find the repeating parts of this text of a sufficient length. For example, let us set the limit to 4 (so we are looking for strings longer than 4), so the name Joe is left out. So we should have:
['noname podcast', 'joining us today']
.
I had an idea of using the difflib
for this, but it only works by comparing two texts, so I tried feeding it the same text two times and picking sequences which will appear more than twice with difflib.SequenceMatcher
, but it just returns one sequence which is the whole text (not very surprisingly, really).
What would be the correct way to approach this?
CodePudding user response:
Here is some proposal. This operates at the word level using a simple split but you can tune this to sentences or any other split depending on your interests. Also importantly, you'd need some preprocessing (I just cleaned for dots but in a large text more should be done).
a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''
mylist=a.split()
# Some preprocessing:
mylist=[i.replace('.','') for i in mylist ]
from collections import Counter
my_dict=Counter(mylist)
[i for i in mylist if len(i)>4 and my_dict[i]>1]
CodePudding user response:
using nltk
from collections import Counter
import nltk
a = '''
- Hello, this is a noname podcast and joining us today is Joe.
- Who's Joe? you would ask...
...
- Anyway, thank you for listening to the noname podcast. Special thanks to Joe, who was joining us today.
'''
# function to replace words
def word_remover(string, noise_list):
for v in noise_list:
string = string.replace(v, "")
return string
data_split = nltk.word_tokenize(a)
repeat_words = [k for k, v in Counter(data_split).items() if v >= 2 and len(k) > 1]
get_index = [data_split.index(i) for i in repeat_words]
find_seq = []
for i in get_index:
if i == get_index[-1]:
break
if i 1 in get_index:
find_seq.append(data_split[i])
find_seq.append(data_split[i 1])
duplicates = [i for i in repeat_words if i in find_seq or len(i) > 3]
print(duplicates)
print(word_remover(a, duplicates))
>>>>['noname', 'podcast', 'joining', 'us', 'today']
>>>>- Hello, this is a and is Joe.
>>>>- Who's Joe? you would ask...
>>>>...
>>>>- Anyway, thank you for listening to the . Special thanks to Joe, who was .