Foundation models for tabular data

My whole academic experience was focused on tabular data in one way or another, whereas I have spent most of my career in industry working with computer vision foundation models. The recent implementations of tabular foundation models (TFMs) therefore sparked quite some curiosity and a little astonishment on my end.

This blog post is simply a brief overview of current TFMs that are based on prior-data fitted networks (PFNs) [Müller et al., 2022; Hollmann et al., 2022]. After such a model has been pre-trained on large amounts of synthetic data, it can be used on real data through a form of in-context learning (ICL).

Prior-data fitted networks

PFNs have emerged in the context of Bayesian inference for a supervised ML problem [Müller et al., 2022; Nagler, 2023]. Rather than explicitly computing the posterior distribution of the unknown parameters, a transformer is used to approximate the corresponding predictive distribution directly.

To this end, one trains an appropriate PFN model over a prior distribution of synthetic tabular datasets. This prior embodies certain structural assumptions about data-generating mechanisms. It needs to be specified in a way that is amenable to sampling. A PFN can then be pre-trained with a large number of synthetically generated datasets from the prior.

After pre-training has been completed on synthetic data, one can compute predictions for unseen real datasets through ICL. The latter happens by passing both the train and test splits into the PFN.

Posterior predictive distribution

As a brief reminder, consider a standard supervised learning task where one wants to find a model predicting certain target variables \(\boldsymbol{y}\) for provided values of the inputs \(\boldsymbol{x}\). Given an i.i.d. dataset \(D_{\mathrm{train}} = \{(\boldsymbol{x}_i, \boldsymbol{y}_i)\}_{i=1}^N\) of realized inputs and targets, one can infer the unknown parameters \(\boldsymbol{\phi}\) of a statistical data model \(p(D_{\mathrm{train}} \vert \boldsymbol{\phi})\).

Adapting a Bayesian inference approach to this task, one starts by eliciting a prior distribution \(p(\boldsymbol{\phi})\) on the unknown parameters. This yields the joint distribution \(p(D_{\mathrm{train}}, \boldsymbol{\phi}) = p(D_{\mathrm{train}} \vert \boldsymbol{\phi}) \, p(\boldsymbol{\phi})\). The posterior distribution \(p(\boldsymbol{\phi} \vert D_{\mathrm{train}})\) then summarizes the information about the model parameters after conditioning on the observed data. It is obtained via Bayes’ theorem

\[p(\boldsymbol{\phi} \vert D_{\mathrm{train}}) = \frac{p(D_{\mathrm{train}} \vert \boldsymbol{\phi}) \, p(\boldsymbol{\phi})}{p(D_{\mathrm{train}})}, \quad p(D_{\mathrm{train}}) = \int p(D_{\mathrm{train}} \vert \boldsymbol{\phi}) \, p(\boldsymbol{\phi}) \, d \boldsymbol{\phi}.\]

Following this, one usually wants to make predictions for a new input \(\boldsymbol{x}_{\mathrm{test}}\). In the Bayesian framework, this is usually accomplished by computing the posterior predictive distribution (PPD)

\[\begin{align*} p(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}}) &= \int p(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, \boldsymbol{\phi}) \, p(\boldsymbol{\phi} \vert D_{\mathrm{train}}) \, d \boldsymbol{\phi} \\ &\propto \int p(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, \boldsymbol{\phi}) \, p(D_{\mathrm{train}} \vert \boldsymbol{\phi}) \, p(\boldsymbol{\phi}) \, d \boldsymbol{\phi}. \end{align*}\]

This probability distribution represents the uncertainty of the predictions \(\boldsymbol{y}_{\mathrm{test}}\) in fully Bayesian fashion.

Prior fitting

Now we shift the focus away from the explicit posterior for a specific train set to the prior distribution of tabular datasets. In the probabilistic setup outlined above, this prior \(p(D)\) over plausible datasets \(D\) has the form

\[p(D) = \int p(D \vert \boldsymbol{\phi}) \, p(\boldsymbol{\phi}) \, d \boldsymbol{\phi}.\]

This notation also covers the case where one splits the dataset \(D = D_{\mathrm{train}} \cup D_{\mathrm{test}}\) into a train and test part. For simplicity, below we consider a single test sample \(D_{\mathrm{test}} = \{(\boldsymbol{x}_{\mathrm{test}}, \boldsymbol{y}_{\mathrm{test}})\}\) only.

The main idea of PFNs is to find the parameters \(\boldsymbol{\theta}\) of a neural network \(q_{\boldsymbol{\theta}}(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})\) such that it best approximates the PPD over the dataset prior. This variant of amortized Bayesian inference is referred to as prior fitting. It can be accomplished by minimizing the prior-data negative log-likelihood

\[\ell(\boldsymbol{\theta}) = \mathbb{E}_{D_{\mathrm{train}}, \{(\boldsymbol{x}_{\mathrm{test}}, \boldsymbol{y}_{\mathrm{test}})\}} \left[ - \log q_{\boldsymbol{\theta}}(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}}) \right].\]

As usual, one can show that minimizing such a negative log-likelihood is equivalent to minimizing the expected KL divergence of \(p(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})\) from \(q_{\boldsymbol{\theta}}(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})\):

\[\begin{align*} \hat{\boldsymbol{\theta}} &= \operatorname*{argmin}_{\boldsymbol{\theta}} \mathbb{E}_{D_{\mathrm{train}}, \boldsymbol{x}_{\mathrm{test}}} \left[ \operatorname{KL}(p(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}}), q_{\boldsymbol{\theta}}(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})) \right] \\ &= \operatorname*{argmin}_{\boldsymbol{\theta}} \mathbb{E}_{D_{\mathrm{train}}, \boldsymbol{x}_{\mathrm{test}}} \left[ \operatorname{H}(p(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}}), q_{\boldsymbol{\theta}}(\cdot \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})) \right] \\ &= \operatorname*{argmin}_{\boldsymbol{\theta}} \mathbb{E}_{D_{\mathrm{train}}, \boldsymbol{x}_{\mathrm{test}}, \boldsymbol{y}_{\mathrm{test}}} \left[ - \log q_{\boldsymbol{\theta}}(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}}) \right]. \end{align*}\]

Of course, one can generalize the derivation above for a single test sample to multiple test samples \(D_{\mathrm{test}} = \{(\boldsymbol{x}_j, \boldsymbol{y}_j)\}_{j=1}^M\). The corresponding objective then reads

\[\ell(\boldsymbol{\theta}) = \mathbb{E}_{D_{\mathrm{train}}, D_{\mathrm{test}}} \left[ - \sum_{j=1}^M \log q_{\boldsymbol{\theta}}(\boldsymbol{y}_j \vert \boldsymbol{x}_j, D_{\mathrm{train}}) \right].\]

TabPFN architecture

Note that a single forward pass of the PFN \(q_{\boldsymbol{\theta}}(\boldsymbol{y}_{\mathrm{test}} \vert \boldsymbol{x}_{\mathrm{test}}, D_{\mathrm{train}})\) for ICL operates simultaneously on the training data \(D_{\mathrm{train}} = \{(\boldsymbol{x}_i, \boldsymbol{y}_i)\}_{i=1}^N\) and test inputs \(\boldsymbol{x}_{\mathrm{test}}\). Here, we briefly discuss the prominent TabPFN model family [Hollmann et al., 2022; Hollmann et al., 2025; Prior Labs Team, 2026] that implements this structure through carefully applied attention mechanisms.

Different architectural designs were proposed that encode either individual cells or entire rows of the train set and the test features. The corresponding tokens can then attend each other in certain ways.

As an example, the original TabPFN-1 architecture [Hollmann et al., 2022] embeds each row and lets train set rows attend to each other. However, they are not allowed to attend to the test rows. Test rows can attend to all train rows, but they cannot attend to one another. Positional encodings are omitted for row-wise permutation invariance.

The successor TabPFN-2 [Hollmann et al., 2025] uses cell-based embeddings instead and alternates between row-wise and feature-wise attention layers. Since this scales quadratically with the number of features, the recently published TabPFN-3 [Prior Labs Team, 2026] again uses row-based representations in combination with a row and column compression scheme that is taken from [Qu et al., 2025; Qu et al., 2026].

Dataset prior

Since the PFN pre-training objective is defined with respect to a dataset prior, the specification of a suitable distribution is a very crucial matter. In order to ensure a good PPD approximation for a real inference task with unseen data, this prior should at least some degree reflect the complexity and diversity of real-world tabular datasets.

For the practical optimization, a Monte Carlo estimate of the objective is employed. The prior distribution can therefore be specified implicitly as a sampling scheme to generate synthetic datasets.

TabPFN utilizes structural causal models (SCMs) that are based on a directed acyclic graph (DAG) with certain computational nodes [Hollmann et al., 2025; Prior Labs Team, 2026]. After high-level parameters such as the dataset size or the number of features are sampled, the DAG structure is randomly generated. The root nodes are randomly initialized and the computational graph is traversed while injecting noise at each node. Eventually, feature target and hidden variables are selected from the SCM nodes, and a final postprocessing routine is applied to the resulting classification or regression dataset.

The generative SCM process attempts to capture some of the main features of tabular data. This includes categorical variables, missing values, and spatial or temporal correlations.

Discussion

Prior-fitted TFMs offer a thought-provoking perspective on tabular data modeling. The training and inference phases of a classical supervised approach are replaced by a pre-training step on synthetic data and ICL with real data. This elegant inference mechanism avoids gradient-based re-training for each new dataset. Other strengths of PFNs comprise built-in predictive uncertainty quantification and their ability to handle small datasets.

PFNs have a number of evident weaknesses as well. The prior fitting procedure is expensive and requires a large number of synthetic datasets. But, no matter how sophisticated the prior is, it is still unlikely to capture all peculiarities of real-world tabular data. Since the inference-time computation of the predictions requires the entire dataset in the input, the approach may be less suitable for online or batched predictions in certain latency-constrained applications. In general, the approach does not scale well to large datasets.

In conclusion, PFNs establish a fascinating alternative to classical tree-based ensemble methods. They may be especially beneficial for small to mid-sized datasets. It will be super interesting to see how TFMs will evolve in the future.