Mapping timeseries sequence input shape to desired output shape using EinsumDense-CodePudding

Can anyone help me understand how to handle compressing/expanding the dimension of a tensor using EinsumDense?

I have a timeseries (not NLP) input tensor of the shape (batch, horizon, features) wherein the intended output is (1, H, F); H is an arbitrary horizon and F is an arbitrary feature size. I'm actually using EinsumDense as my Feed Forward Network in a transformer encoder module and as a final dense layer in the transformer's output. The FFN should map (1, horizon, features) to (1, H, features) and the final dense layer should map (1, H, features) to (1, H, F).

My current equation is shf,h->shf for the FFN, and shf,hfyz->syz for the dense layer, however I'm getting a less than optimal result as compared to my original setup where there was no change in the horizon length and my equations were shf,h->shf and shf,hz->shz respectively.

CodePudding user response：

My two cents,

First, intuitive understanding of the transformer encoder: Given (batch, horizon, features), attention mechanism tries to find a weighted linear combination of the (projected) features . The combining weights are obtained by the attention scores. This step is between features. Here one token (horizon) information is mixed with other. The FFN layer that comes next does a linear combination of values within features.

Coming to EinsumDense layers defined above.

shf,h->shf: This just scales the individual features.

#example
a = tf.random.uniform(minval=1, maxval=3, shape=(1,2,3), dtype=tf.int32)
#[[[1, 2, 2],
   [2, 2, 1]]       
b = tf.random.uniform(minval=2, maxval=4, shape=(2,), dtype=tf.int32) 
#[3, 2]
tf.einsum('shf,h->shf', a,b)
# [[[3, 6, 6], #first feature is scaled with 3
    [4, 4, 2]]],#second feature is scaled with 2

shf,hz->shz: This does a linear combination within features

#example
b = tf.random.uniform(minval=2, maxval=4, shape=(2,6), dtype=tf.int32)
#[[3, 3, 3, 3, 3, 3],
  [2, 2, 2, 3, 2, 3]]
tf.einsum('shf,hz->shz', a,b)
[[[15, 15, 15, 15, 15, 15], #here every value is a linear combination of the first feature [1, 2, 2] with b. The first value is sum([1,2,2]*3)
  [10, 10, 10, 15, 10, 15]]

The above two resembles the transformer encoder architecture, with a feature scaling layer. And the output structure is preserved (batch, H, F)

shf,hfyz->syz: This does both between features and within features combination.

#example
b = tf.random.uniform(minval=2, maxval=4, shape=(2,3,4,5), dtype=tf.int32)
tf.einsum('shf,hfyz->syz', a,b)
#each element output `(i,j)` is a dot product of a and b[:,:,i,j] 
#first element is tf.reduce_sum(a*b[:,:,0,0])

Here the output (s,y,z), y doesnt correspond to horizon and z doesnt correspond to features, but a combination of values between them.