In our previous blog post, we introduced latent variable models, where the latent variable can be thought of as a feature vector that has been “encoded” efficiently. This encoding turns the feature vector X into a context vector z. Latent variable models sound very GenAI-zy, but they descend from models that quant traders have long been familiar with.
No doubt you have heard of PCA or SVD (see Chapter 3 of our book for a primer)? Principal components or singular vectors are ways to represent returns in terms of a small number of variables. These variables are latent, or hidden, because they are inferred from the observed returns themselves, and not observable like the Fama-French factors such as HML or SMB. The benefit of applying these latent factors to model returns is that we need fewer parameters – i.e. dimensional reduction. For example, the covariance matrix of 500 stocks’ returns have 125,250 parameters, whereas its 10-principal-component model has only 5,010 parameters. The methods to find these latent factors are diagonalization of the covariance matrix in the PCA case, or singular value decomposition of the “design” (data) matrix in the SVD case.
More generally, latent variable models are used to model the probability distributions of the observed features X:

In the simplest case, z is just a categorical variable that takes 0 or 1 as value, with a binomial distribution, and p(X|z) is a Gaussian with parameters that depend on z. (You might think of the “context vector” z as encoding the information about X in the most compact manner possible: just 0 or 1.) Both the binomial and the Gaussian distributions here have fixed, but unknown, parameters that do not depend on X. This is called a Gaussian Mixture Model (GMM) and p(X) is written as

1 is the probability of z=1, and is a Gaussian with different parameters for each z.
In another familiar case, p(z) is no longer independently distributed, but each zt at time t depends on its previous value zt-1, governed by the transition probability aij

This is the famous HMM (Hidden Markov Model). In a HMM, z still takes on 0 or 1 as values, and p(X|z) is still Gaussian.
We don’t know the actual distribution p(z) of the hidden variable z – after all, it is hidden! So how do we estimate its probability? Unlike PCA or SVD, the training algorithm used to find these unknown but fixed parameters is the celebrated EM (Expectation-Maximization) algorithm.
In the EM algorithm, as in the more general Variational Inference (VI) algorithm to be described later, the key to training the model is to introduce a proposal distribution q(z|X) (which we called the “encoder” in the VAE framework described in the previous blog post) which approximates p(z|X) instead of p(z). We start by estimating q(z|X) using some arbitrary parameters old for p(z|X, old), i.e. q(z|X)=p(z|X, old). In the EM algorithm framework, this proposal distribution is also variously called the membership probability, the responsibility, posterior probability, soft assignment, or state occupancy probability. In our Gaussian mixture case,

Notice the expectations computed for the Gaussians. That’s why this is called E-step. In the VAE framework, you can think of this step as obtaining the output of the encoder.
Now to find a better in the next iteration, we are supposed to maximize the log likelihood LL=log p(X| ) by varying . But we don’t know the actual likelihood in Eqn (1) above, we only know an approximation

Heck, we will maximize Q w.r.t. instead. This is the M-step. Next we set old= that was just optimized and rinse, repeat, until convergence (e.g. when Q doesn’t significantly increase anymore.) In the VAE framework, you can think of this as a “backpropagation” step that optimizes the log likelihood function that the “decoder” p(X|z, ) generates.
Now for the general deep latent variable model, p and q are still Gaussians, but their parameters are no longer constants. They are sample-specific (i.e. they are themselves functions of X). Researchers typically denote the parameters for the encoder q(z|X) as and those for the decoder p(X|z) as . These parameters are now weights and biases of two separate DNNs (deep neural network). As shown in the VAE diagram of the previous blog post, the parameters for q are (q , q)=DNN(X) and those for p are (p , p)=DNN(z).
Note the parallel with transformers. Conventional features selection, like the parameters of a conventional latent variable model (e.g GMM and HMM), are fixed for the entire data set. But transformer-based features selection, like the parameters of a VAE, are sample-specific, offering much more flexibility. Of course, the price of this flexibility and specificity is that VAE can no longer be trained by the EM algorithm. It requires a method called Variational Inference (VI) Approximation. Similar to Eqn (2) above, the LL=log p(X| , ) can be written as an expectation over q, but this time with an explicit error term DKL:

Note the first term, called ELBO (Evidence Lower Bound, pronounced “elbow”), is analogous to Q in Eqn (2), except that it is an explicit function of X. The error term is the Kullback-Leibler Divergence – essentially the difference – between the proposal distribution q(z|X) and the actual posterior distribution p(z|X). Because DKL0 always, LL ん, always, and by maximizing ん, we can maximize LL as well. At the same time, as ん, increases, DKL goes to zero, and q(z|X) will be closer to p(z|X). In other words, the proposal gets more and more realistic. To maximize ん, which is the loss function of the encoder-decoder network DNN(X) and DNN(z), we can apply SGD (stochastic gradient descent) on both and simultaneously. We will also need to assume a simple Gaussian prior p(z)=(0, I). For more details on training, see Chapter 6 of our book.
The entire process of training a VAE (encoder+decoder), just as in training a GMM or HMM, is unsupervised – no labels are needed. As we mentioned in the previous blog post, this allows us to pre-train a VAE using a vast amount of unlabeled data with perhaps only some relevance to the labeled data at hand. Once the VAE is pre-trained, we can use z (a sample of the output of the encoder) as features to train a supervised model for classification or regression. Other ways of using, training, or fine-tuning the VAE were explained in that blog post as well.
In summary, we see how VAE is really a generalization of the more familiar latent variable models like GMM and HMM, except here the parameters of the distributions q(z|X) and p(X|z) themselves depend on the input sample X. This allows for much more flexibility in modeling real-world data, just as the transformer allows for sample-specific features selection. The price to pay for this flexibility is that there are lots more parameters (weights and biases of the encoder and decoder) to fit, and we need much more data to fit them. But the saving grace is that we only need unlabeled training data, which is abundant in most domains including finance.
Predictnow.ai has built a latent variable model that infers the daily risk-on/off regime of financial time series. When applied to the SPX (holding cash when it is risk-off), its Sharpe outperforms (in an out-of-sample backtest 2005-09-01 to 2025-06-30) the index 0.93 vs 0.69. When applied to GLD (gold ETF), its Sharpe outperforms (in OOS backtest 2011-01-03 to 2025-10-21) GLD 0.2 vs 0.1. If you are interested, please contact info@predictnow.ai to subscribe.
