Home > Software engineering >  Combinations of words that co-occur most often across strings
Combinations of words that co-occur most often across strings

Time:07-09

Here's the problem...

I have a list of strings:

strings = ['one two three four', 'one two four five', 'four one two', 'three four']

I'm trying to find combinations of words that co-occur in two or more strings.

And here's the output I'm trying to get...

  • [one, two, four] - 3 times
  • [three, four] - 2 times
  • [one, two] - 3 times
  • [two, four] - 3 times

The combinations could be any length of two or more words.

Here's what I've already looked at - though I'm not having much luck finding anything I can bootstrap for my needs : (

CodePudding user response:

You can compute the powersets with minimum 2 combinations and count the combinations:

from itertools import chain, combinations
from collections import Counter

# https://docs.python.org/3/library/itertools.html
def powerset(iterable, MIN=2):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(MIN, len(s) 1))

c = Counter(chain.from_iterable(set(powerset(s.split()))
            for s in strings))

# keep counts of 2 or more
out = {k: v for k, v in c.items() if v >= 2}

Output:

{('three', 'four'): 2, 
 ('two', 'four'): 2, 
 ('one', 'two', 'four'): 2, 
 ('one', 'four'): 2, 
 ('one', 'two'): 3}

keep order

Use:

c = Counter(chain.from_iterable(tuple(powerset(s.split()))
            for s in strings))
  • Related