I am creating a sparse tensor in tensorflow that is about 4,000,000 X 56,000,000. The 56M columns are the interaction variables between about 10,600 possible values of a feature column, AKA, the combinations of all values.
Tensorflow's sparse tensor takes an indices argument which is a list of lists, where each sublist [x, y] denotes the row and column of a value within the sparse tensor.
I have the combinations of interaction variables:
combos = []
grouped_feature = df.groupby('feature')
for name, group in grouped_feature:
combos.append([*combinations(group.feature.unique(), 2)])
This gives me a list of lists of tuples. The tuples in each sublist correspond to the combinations that should be 1 in my sparse tensor
Then I ran:
indices = []
for i in range(len(combos)):
for j in range(len(combos[i])):
indices.append([i, hash(combos[i][j])])
To get the proper list of lists format, but I need to change the hash function to map each combination to one of 56M values. How can I do this? Or is there a better way to do this? I could not find a built in method/function in tensorflow for populating sparse tensors
CodePudding user response:
You can take the hash mod the number of values in the range that you want to map to.
e.g.
NUM_VALUES = 56 * 10**6
indices = []
for i in range(len(combos)):
for j in range(len(combos[i])):
indices.append([i, hash(combos[i][j]) % NUM_VALUES])