Can anyone help me understand how to handle compressing/expanding the dimension of a tensor using EinsumDense?
I have a timeseries (not NLP) input tensor of the shape (batch, horizon, features)
wherein the intended output is (1, H, F)
; H
is an arbitrary horizon and F
is an arbitrary feature size. I'm actually using EinsumDense
as my Feed Forward Network in a transformer encoder module and as a final dense layer in the transformer's output. The FFN should map (1, horizon, features)
to (1, H, features)
and the final dense layer should map (1, H, features)
to (1, H, F)
.
My current equation is shf,h->shf
for the FFN, and shf,hfyz->syz
for the dense layer, however I'm getting a less than optimal result as compared to my original setup where there was no change in the horizon length and my equations were shf,h->shf
and shf,hz->shz
respectively.
CodePudding user response:
My two cents,
First, intuitive understanding
of the transformer encoder: Given (batch, horizon, features)
, attention mechanism tries to find a weighted linear combination of the (projected) features . The combining weights are obtained by the attention scores. This step is between features
. Here one token (horizon) information is mixed with other. The FFN layer that comes next does a linear combination of values within features
.
Coming to EinsumDense
layers defined above.
shf,h->shf: This just scales the individual features.
#example
a = tf.random.uniform(minval=1, maxval=3, shape=(1,2,3), dtype=tf.int32)
#[[[1, 2, 2],
[2, 2, 1]]
b = tf.random.uniform(minval=2, maxval=4, shape=(2,), dtype=tf.int32)
#[3, 2]
tf.einsum('shf,h->shf', a,b)
# [[[3, 6, 6], #first feature is scaled with 3
[4, 4, 2]]],#second feature is scaled with 2
shf,hz->shz: This does a linear combination within
features
#example
b = tf.random.uniform(minval=2, maxval=4, shape=(2,6), dtype=tf.int32)
#[[3, 3, 3, 3, 3, 3],
[2, 2, 2, 3, 2, 3]]
tf.einsum('shf,hz->shz', a,b)
[[[15, 15, 15, 15, 15, 15], #here every value is a linear combination of the first feature [1, 2, 2] with b. The first value is sum([1,2,2]*3)
[10, 10, 10, 15, 10, 15]]
The above two resembles the transformer encoder
architecture, with a feature scaling layer. And the output structure is preserved (batch, H, F)
shf,hfyz->syz: This does both between
features and within
features combination.
#example
b = tf.random.uniform(minval=2, maxval=4, shape=(2,3,4,5), dtype=tf.int32)
tf.einsum('shf,hfyz->syz', a,b)
#each element output `(i,j)` is a dot product of a and b[:,:,i,j]
#first element is tf.reduce_sum(a*b[:,:,0,0])
Here the output (s,y,z), y doesnt correspond to horizon and z doesnt correspond to features, but a combination of values between them.