Here's the problem...
I have a list of strings:
strings = ['one two three four', 'one two four five', 'four one two', 'three four']
I'm trying to find combinations of words that co-occur in two or more strings.
And here's the output I'm trying to get...
- [one, two, four] - 3 times
- [three, four] - 2 times
- [one, two] - 3 times
- [two, four] - 3 times
The combinations could be any length of two or more words.
Here's what I've already looked at - though I'm not having much luck finding anything I can bootstrap for my needs : (
Efficient way of extracting co-occurence values of specific word pairs from Python Counter() results
CodePudding user response:
You can compute the powersets with minimum 2 combinations and count the combinations:
from itertools import chain, combinations
from collections import Counter
# https://docs.python.org/3/library/itertools.html
def powerset(iterable, MIN=2):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(MIN, len(s) 1))
c = Counter(chain.from_iterable(set(powerset(s.split()))
for s in strings))
# keep counts of 2 or more
out = {k: v for k, v in c.items() if v >= 2}
Output:
{('three', 'four'): 2,
('two', 'four'): 2,
('one', 'two', 'four'): 2,
('one', 'four'): 2,
('one', 'two'): 3}
keep order
Use:
c = Counter(chain.from_iterable(tuple(powerset(s.split()))
for s in strings))