I'm trying to use regex to get all combinations of two words from raw HTML.
possibleNames = []
data = '<p>or lucy clark and jim hudson sandler for temenos,<br>'
ndata = findall("([A-Za-z] [A-Za-z] )", data)
for item in ndata:
possibleNames.append(item)
print(possibleNames)
I have been trying to find a way to get it to append all possible combinations, but the output looks like this:
['or lucy', 'clark and', 'jim hudson', 'sandler for']
How do I get it to output all two word phrases from the input... (['or lucy', 'lucy clark', 'clark and', 'and jim', ...])
CodePudding user response:
A regular expression isn't necessary: you just need to use .split()
and iterate over the result from that function.
text = 'or lucy clark and jim hudson sandler for temenos'
words = text.split(' ')
bigrams = []
for i in range(len(words) - 1):
bigrams.append(' '.join([words[i], words[i 1]]))
print(bigrams)
CodePudding user response:
One possible solution without regex:
print([' '.join(a, b) for a, b in zip(words, words[1:])])
CodePudding user response:
One approach, using re.findall
to find all complete words pairs inside the two HTML tags:
data = '<p>or lucy clark and jim hudson sandler for temenos,<br>'
inp = re.findall(r'<p>(.*?)<br>', data)[0]
pairs = re.findall(r'\w \w ', inp)
print(pairs) # ['or lucy', 'clark and', 'jim hudson', 'sandler for']