Home > Software design >  Creating an index from a list of words in Python (to vectorise words)
Creating an index from a list of words in Python (to vectorise words)

Time:12-18

So I have a long list of words called word_list (that contains duplicates) and also have a set word_set that is just the word_list but without duplicates.

word_list = ['this','start','wonderland','amaze','this',.....]
word_set = set(word_list)

I also have a function that takes the word_list as input (raw_text):

def CBOW(raw_text, window_size=2):
   data = []
   for i in range(window_size, len(raw_text) - window_size):
       context = [raw_text[i - window_size], raw_text[i - (window_size - 1)], raw_text[i   (window_size - 1)], raw_text[i   window_size]]
       target = raw_text[i]
       data.append((context, target))

   return data

# The returned data has the form: [(['this', 'start', 'amaze', 'this'], 'wonderland'), .....]

It returns the corresponding words within the window size of a word (here it is wonderland)

I would like to instead have this return the index of those words from word_set:

For example, instead of

[(['this', 'start', 'amaze', 'this'], 'wonderland'), .....]

I would want

[(['0', '1', '2', '0'], 'wonderland'), .....] (as 'this' is a duplicate value so has index 0 )

Could anyone help me with this task?

For testing I have put a more comprehensive word_list here (along with my function) https://pastebin.com/EuS20u60

CodePudding user response:

There are two things that should be mentioned :)

  1. set() object is supposed to mimic the set in math, so there is no index number accompanied with each member of the list. The only relation between an element and the set is membership.
  2. When I ran the program I saw the function has nothing to do with the second argument window_size (creates a fixed-length context for totally different window_sizes). There is a similar function in the official PyTorch tutorial for doing the same job which I highly recommend you to take a look at.

Finally, if you are going to use words indices I suggest converting word_set object to a list object which already has index() method. The following is my implementation:

word_list = ['this','start','wonderland','amaze','this','read', 'instal', 'instruct', 'nis', '2004', 'nav', '2004', 'prior', 'latsni', 'still', 'end', 'result', 'junk', 'rawtfos', 'whi', 'instal', 'ina', 'type', 'softwar', 'instal', 'krow', 'proper', 'norton', 'tcudorp', '3', 'latsni', 'either', 'eno', 'norton', 'product', 'neither', 'krow', 'latsni', 'mcafe', 'anti', 'virus', '8', '2', 'comput', 'owner', 'sinc', 'purchas', 'mcafe', 'anti', 'virus', '8', 'instal']
word_set = set(word_list)
list_of_word_set = list(word_set) 
def CBOW(raw_text, window_size=2):
   data = []
   for i in range(window_size, len(raw_text) - window_size):
       context = [raw_text[i - window_size], raw_text[i - (window_size - 1)], raw_text[i   (window_size - 1)], raw_text[i   window_size]]
       context = [list_of_word_set.index(item) for item in context]
       target = raw_text[i]
       data.append((context, target)) 
   return data
CBOW(word_list, 10)

output:

[([11, 27, 21, 25], 'nav'),
 ([27, 5, 25, 0], '2004'),
 ...

CodePudding user response:

Python set members don't have indices, because Python sets are, by design, unordered. But what you want may be (almost) achieved using a dictionary.

(i) Construct word_list_indices dictionary from set(word_list) by assigning a unique number to each word.

word_list_indices = {w: i for i, w in enumerate(set(word_list))}

You can't control what number will be associated with which word but each word will have a corresponding unique number, which is all you need for this exercise. Similar thing can be done using a list or a tuple too but it's expensive to call list(set(word_list)).index(w) every time you want the index of a word; dictionaries are efficient.

(ii) Then add this line:

context = [word_list_indices[w] for w in context]

into the CBOW function (between the creation of context and the appending of the tuple (context, target) to data). This list comprehension is referencing the unique number associated with every word in context.

Output:

>>> print(CBOW(word_list))

[([2, 0, 1, 2], 'wonderland')]
  • Related