How can I get all possible combinations using regex-CodePudding

I'm trying to use regex to get all combinations of two words from raw HTML.

possibleNames = []

data = '<p>or lucy clark and jim hudson sandler for temenos,<br>'

ndata = findall("([A-Za-z]  [A-Za-z] )", data)
for item in ndata:
    possibleNames.append(item)

print(possibleNames)

I have been trying to find a way to get it to append all possible combinations, but the output looks like this:

['or lucy', 'clark and', 'jim hudson', 'sandler for']

How do I get it to output all two word phrases from the input... (['or lucy', 'lucy clark', 'clark and', 'and jim', ...])

CodePudding user response：

A regular expression isn't necessary: you just need to use .split() and iterate over the result from that function.

text = 'or lucy clark and jim hudson sandler for temenos'
words = text.split(' ')
bigrams = []
for i in range(len(words) - 1):
    bigrams.append(' '.join([words[i], words[i   1]]))

print(bigrams)

CodePudding user response：

One possible solution without regex:

print([' '.join(a, b) for a, b in zip(words, words[1:])])

CodePudding user response：

One approach, using re.findall to find all complete words pairs inside the two HTML tags:

data = '<p>or lucy clark and jim hudson sandler for temenos,<br>'
inp = re.findall(r'<p>(.*?)<br>', data)[0]
pairs = re.findall(r'\w  \w ', inp)
print(pairs)  # ['or lucy', 'clark and', 'jim hudson', 'sandler for']