^^ See above comment and "Transformers are RNNs" paper to convince yourself. The...

^^ See above comment and "Transformers are RNNs" paper to convince yourself.

There are various ways of seeing how transformer architectures work.

For the past data, using a temporal causal mask, you can compute all past features as they would have been seen by applying the causal mask, which allows you to seemingly do all the past computations in parallel, which hide the sequential aspect of it.

What I've described correspond to the the simpler subvariant figure 1 of "Attention is all you need" https://arxiv.org/pdf/2006.16236.pdf , of the transformer architecture in the case where you don't give the past as input to the input branch, put rather prepend it to the output, which is the what people usually do in practice (see llama) ). If you want to use both branches of the transformer architecture, so that you can do filtering and smoothing (aka not using a causal mask) for the past data, it creates a "bottleneck" in your architecture, as you synthesize all the past into a finite sized context vector that you then use in the right branch of the transformer.

But alternatively you can loop along the time dimension first, and make the sequence more apparent. Due to how the (causally mask) transformer architecture is defined this gives exactly the same computation. It's just some loop reordering.

For the generation part, the sequential aspect is more evident as token are produce one by one sequentially and feedback to the transformer for the next token prediction.