By: Hamlet Medina, Ernest Chan, Uttej Mannava, Johann Abraham
In the previous blog post, we gave a very simple example of how traders can use self-attention transformers as a feature selection method: in this case, to select which previous returns of a stock to use for predictions or optimizations. To be precise, the transformer assigns weights on the different transformed features for downstream applications. In this post, we will discuss how traders can incorporate different feature series from this stock while adding a sense of time. The technique we discuss is based partly on Prof. Will Cong’s AlphaPortfolio paper.
Recall that in the simple example in a Poor Person’s Transformer, the input X is just a n-vector with previous returns X=[R(t), R(t-1), …, R(t-n+1)]T. Some of you fundamental analysts will complain “What about the fundamentals of a stock? Shouldn’t they be part of the input?” Sure they should! Let’s say, following AlphaPortfolio, we add B/M, EPS, …, all 51 fundamental variables of a company as input features. Furthermore, just as for the returns, we want to know the n previous snapshots of these variables. So we expand X from 1 to 52 columns (including the returns column). For concreteness, let’s say we use n=12 snapshots, captured at monthly intervals, and regard R(t) as the monthly return from t-1 to t. X is now a 12 × 52 matrix.
Are we ready to use them as input to our transformer? Goodness, no! As we said in our previous blog post, raw heterogeneous features, i.e. features that are from different spaces such as returns vs EPS, can’t be mixed up in a transformer without normalization. In AlphaPortfolio, the authors normalize them cross-sectionally, i.e. for each snapshot of time, compute the mean and std of a feature across the universe of stocks and thus turn every feature into a z-score. In our case, we only have 1 stock, so we have to normalize temporally, i.e. compute the mean and std of a feature in a lookback period in order to compute z-scores. For simplicity, we might as well use 12 months as the lookback period.
After normalizing all these features into z-scores, are we ready to use them as input to our transformer? Goodness, no! We usually project (linearly transform) the raw features into a higher dimensional embedding space before input to a transformer. In our case, we will project the 52 features into a 64-dimensional space (the embedding space dimension is usually a power of 2 and is larger that the original feature space):

where Wembed(t) is a 52 × 64 projection / embedding matrix, with values to be optimized based on the downstream objective function, and Xembed(t) is a 12 × 64 input matrix projected to the embedding space.
After this embedding, are we ready to use them as input to our transformer? Goodness, no! (Ernie has been reading too much Pete The Cat to his kids, hence the idiom.) Unlike a LSTM, our transformer does not know that feature X(t) comes after X(t-1) in time: there is no sense of time ordering. We need to apply “positional encoding”. This is done by adding a “positional encoding vector” PE to each input that encodes its position in the time series:
where i is the lookback from 0 to 11 months with i=0 pointing to the current time t, and j is the feature index in the embedding space, and


So k=0 to d/2-1, that is from 0 to 64/2-1. (Recall d=64 is the embedding dimension.)
What’s the intuition behind this formula? In general, if we have a time series Y(t) sampled at discrete intervals, we can always perform a Fourier decomposition:
PE(i, j) looks just like each of these cosine or sine basis function, with 1/10000^(2k/64) playing the role of the frequency (2 π j/N). So by adding PE, we are adding each of these cosine and sine basis functions to the original Xembed(t), and hope that the transformer will treat Xinput=Xembed+PE as a time series. After all, as discussed in the previous blog post, the transformer is going to add the different basis functions up with some coefficients via matrix multiplications and make them into Q, K, V matrices:

So Q, K, V are just different time series that are transformed versions of Xinput, each a projection to spaces with different dimensions. We find this not particularly mathematically rigorous. But hey, this is engineering, not science, and the final judge is whether it works. Also, there are alternatives to sinusoidal positional encoding. Just ask ChatGPT!
Once we have these Q, K, and V, the rest are standard transformer stuff, whether this is a time series or a sequence of words, and the output is a context matrix Z of dimension 12 × 64 in our case (the same dimension as Xinput). Each row of Z still represents a different lagged set of mixed features.
For ease of downstream processing, AlphaPortfolio flatten the Z matrix into a feature vector with dimension 768 × 1, which they simply call r (not to be confused with the raw returns R(t)).
As usual, you can use these transformed features downstream for supervised or reinforcement learning as you like. In the next blog post, we will talk about what happen when we have more than 1 stocks we want to use as input.