Home > Enterprise >  Python find when two words are N words apart
Python find when two words are N words apart

Time:01-02

I have a dictionary of names (where the key is the name and the value is a list of aliases associated to that name), and a body of text. I want to find and count the number of times a pair of names are N words apart in the body of text.

For example :

found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'

N = 6

We would get some sort of result like :

[
    ('Jack', 'Jim', 1), 
    ('Jim', 'Chris', 1), 
    ('Chris', 'James', 2), 
    ('Chris', 'Jack', 1)
]

Note, there is no need to reference the same name twice during the period, so the result doesn't have (Chris, Chris)

Is there a way to do this efficiently? I was thinking of finding the position of each name in the names list in the body of text and storing it as a dictionary where the key is the name and the values is a list of indicies corresponding to the position of the name in the body of text.

I'm not sure what to do after that, can anyone help. The code I have so far :

N = 16
t = text.split(' ')
pos_dct = {} # key is going to be the name, value is going to. be the list of positions
name_lst = [[k]   v for k,v in found_names.items()]
for i,w in enumerate(t):
    if w in name_lst:
        if w in pos_dct:
            pos_dct[w].append(i)
        else:
            pos_dct[w] = [i]

CodePudding user response:

The main idea here is to (1) includes the name itself in the lookup array in found_names, and then (2) convert your input string to a dictionary of indices; each index will have just one name attached to it, if the word in that particular index is a name (or an alias).

After this, (3) for each index we will check if any index greater than current one is found in range (given by N); if so, we will increase the counter for the pair current name, other name.

# 0. Variables setup
from collections import defaultdict

N = 6
s = "Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper"
found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

# 1. Includes the name itself in the lookup array
found_names_full = found_names.copy()
for k,a in found_names_full.items():
    a.append(k.lower())
# {'jack': ['ripper', 'drought', 'jack'], 'jim': ['carrey', 'gaffigan', 'jim'], 'chris': ['hemsworth', 'evans', 'pratt', 'brown', 'chris'], 'james': ['bond', 'james']}

# 2. Check, word by word, if it's included in the lookup array.
# If so, store for the current index the name (key)
s2 = dict()
for i, word in enumerate(s.lower().split()):
    word = word.strip(",").strip(".")
    # print(word)
    for k,a in found_names_full.items():
        if word not in a:
            continue
        s2[i] = k
        
s2  # {1: 'jack', 5: 'jim', 8: 'chris', 13: 'james', 14: 'chris', 19: 'jack'}

# 3. Get the list of indices of matched words
s3 = defaultdict(int) 
s2k = list(s2.keys())

for i,k in enumerate(s2k):
    # 3.1 Given an index, get the sublist of all indices greater than 
    # current index
    if i < len(s2k) - 1:
        k2l = s2k[i 1:]
    else:
        k2l = []

    # 3.2 For each index greater than current index, check if
    # is found in range
    for k2 in k2l:
        if k2-k < N:
            # 3.3 Get names found in current position (k)
            # and index greater than current but in range N (k2)
            n1 = s2[k]
            n2 = s2[k2]
            # 3.4 Get a sorted key 
            key = tuple(sorted([n1,n2]))
            # 3.5 And add 1 to counter
            s3[key]  = 1

s3  # defaultdict(<class 'int'>, {('jack', 'jim'): 1, ('chris', 'jim'): 1, ('chris', 'james'): 2, ('chris', 'jack'): 1})

If you want the output to resemble yours, then you need this:

s4 = list([(*k,v) for (k,v) in s3.items()])
s4  # [('jack', 'jim', 1), ('chris', 'jim', 1), ('chris', 'james', 2), ('chris', 'jack', 1)]

CodePudding user response:

I would try this. And if dictionary keys are lower-case letters, so I would add lower() function to word.strip(string.punctuation) as word.strip(string.punctuation).lower()

import string


text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'

found_names = {
    'Jack' : ['ripper', 'drought'],
    'Jim' : ['carrey', 'gaffigan'],
    'Chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'James' : ['bond']
}


def names():
    x_position = 0
    for x in found_names.keys():
        x_position  = 1
        y_position = 0
        for y in found_names.keys():
            y_position  = 1
            if y_position > x_position:
                yield (x, y)

def pair_count_list():
    for names_pair in names():
        counted = sum(    
            1
            for word in text.split()
            if word.strip(string.punctuation) in names_pair
        )
        yield (names_pair   (counted,))


values = [x for x in pair_count_list()]
print(values)

Output:

[
    ('Jack', 'Jim', 2)
    , ('Jack', 'Chris', 1)
    , ('Jack', 'James', 1)
    , ('Jim', 'Chris', 1)
    , ('Jim', 'James', 1)
    , ('Chris', 'James', 0)
]

CodePudding user response:

I had a similar problem in a course on Python for natural language processing recently, so thought I'd share an alternative solution to the above.

The ngrams module in the nltk library provides an elegant and concise way to solve this problem. 'ngrams' are all the sequences of words of length n in a text. So we just iterate over the ngrams checking the first and last words of each against the names/dictionary of 'found names'.

from itertools import combinations
from nltk import ngrams, word_tokenize
from pprint import PrettyPrinter
pp = PrettyPrinter()

found_names = {
    'jack' : ['ripper', 'drought'],
    'jim' : ['carrey', 'gaffigan'],
    'chris' : ['hemsworth', 'evans', 'pratt', 'brown'],
    'james' : ['bond']
}

# Get pairs of different names without repetitions
name_pairs = list(combinations(sorted(found_names.keys()), 2))

text = 'Hello Jack, my name is Jim. My friend Evans has seen you with Bond. Evans says they call you ripper'
# Get list of words in sentence, lowercased and ignoring punctuation
words = [word.lower() for word in word_tokenize(text) if word.isalpha()]

N = 6
# ngrams are sequences of words of length n from the text
n_grams = list(ngrams(words, N))

result = []

for name1, name2 in name_pairs:
    count = 0
    for n_gram in n_grams:
        first_word, last_word = (n_gram[0], n_gram[-1])
        # Check both orders
        if first_word == name1 or first_word in found_names[name1]:
            if last_word == name2 or last_word in found_names[name2]:
                count  = 1
        elif last_word == name1 or last_word in found_names[name1]:
            if first_word == name2 or first_word in found_names[name2]:
                count  = 1
    result.append((name1, name2, count))

pp.pprint(result)

Output:

[('chris', 'jack', 1),
('chris', 'james', 1),
('chris', 'jim', 0),
('jack', 'james', 0),
('jack', 'jim', 0),
('james', 'jim', 0)]
  • Related