Probably this question is asked earlier, but I could not find. I want to be able to represent an input element as one hot encoded entity.
For that, do I need to create a dictionary of one hot encoded items? Or, how can I make sure, each time one new-word is coming that will be represented by a correct encoded element without a problem? Do I need to build a dictionary? How I can I handle unknown?
For example,
category = set(["Sweden", "Iceland", "Germany"])
My input: Sweden
Output : 1, 0, 0
My input: Germany
Output : 0, 0, 1
My input: Poland (unknown)
Output : 0, 0, 0
Can someone please put some lights on this? Thank you in advance.
CodePudding user response:
I think, simplest was to use torch with a dictionary.
import torch
import torch.nn.functional as F
my_list = ["unknown", "hello", "world", "this", "is", "a", "test"]
print(my_list)
dictionary = {}
for i, element in enumerate(my_list):
dictionary[element] = i
print(dictionary)
num_classes = len(x) 1
F.one_hot(x, num_classes)
# query: world
F.one_hot(torch.tensor(dictionary.get("world", "unknown"), num_classes)
CodePudding user response:
Something like this, one just needs the set of categories as a list:
all_categories = list(set(["Sweden", "Iceland", "Germany"]))
print(all_categories)
# Out: ['Germany', 'Sweden', 'Iceland']
Given categories as a list of unique names:
def hotEncode(cat, all_categories): # assuming all_categories is a list
r =[0]*len(all_categories) # array of zeros
if cat in all_categories:
n = all_categories.index(cat)
r[n] = 1
return r
hotEncode("Iceland", all_categories)
# Out: [0, 0, 1]
hotEncode("Poland", all_categories)
# Out: [0, 0, 0]