Ask a modern neoantigen pipeline which peptides a patient's tumor will display on its surface, and it will answer with something close to authority. Ask it which of those peptides the patient's immune system will actually attack, and the confident voice cracks. This is the central, stubborn problem of the entire field: we can predict the stage on which the immune drama plays out, but not whether the audience reacts. Everything downstream of that gap — vaccine design, clinical trial enrollment, the economics of personalized therapy — inherits its uncertainty.
The reason the gap is so persistent is that "neoantigen prediction" is not one problem but three stacked on top of each other: binding, presentation, and immunogenicity. The first two are largely tractable. The third is where the field still struggles — and conflating them is the single most common source of overpromising. It is worth separating them carefully, because the trajectory is genuinely encouraging once you see where the real bottleneck sits.
Predicting whether a short peptide will bind a given MHC class I allele is, by the standards of computational biology, a success story. Tools in the NetMHCpan family — trained on large bodies of binding-affinity and eluted-ligand data — predict peptide–MHC binding with high accuracy across thousands of alleles, and independent benchmarking has repeatedly confirmed their performance. The biophysics is favorable: binding is a relatively local, well-defined molecular event, the positive data are abundant, and mass-spectrometry immunopeptidomics has supplied ground truth at scale.
There is even good evidence that a refinement of binding — the stability of the peptide–MHC complex, rather than affinity alone — correlates better with downstream T-cell immunogenicity, because the complex must persist at the cell surface long enough to be sampled by rare circulating T cells. But this is still a property of the peptide and the MHC. It is not yet about the immune system that has to respond.
The next layer up — will the peptide actually be processed, transported, loaded, and presented? — is harder than raw binding but has also matured. Models that integrate eluted-ligand data capture proteasomal cleavage and the realities of the antigen-processing machinery, and they meaningfully outperform affinity-only predictors at identifying naturally presented ligands. When practitioners talk about tools "agreeing," this is the layer they mean: on presentation, different pipelines tend to converge on similar candidate lists.
Presentation is where the comfortable part of the pipeline ends. A peptide that is reliably displayed on a tumor cell surface has cleared every hurdle a molecule can clear on its own. What remains is no longer a property of the peptide at all.
Immunogenicity is a property of the encounter between a presented peptide–MHC complex and a particular patient's T-cell receptor repertoire. That single sentence contains the whole difficulty, and it explains why the problem is structurally harder than the two below it.
First, the repertoire is individual and enormous. Each patient carries a distinct, somatically generated set of TCRs, shaped by which receptors survived thymic selection. Whether any TCR in that repertoire recognizes a given neoepitope is not encoded in the peptide — it is encoded in an immune history the model usually cannot see. A peptide that is immunogenic in one patient can be invisible in another with the same HLA type.
Second, tolerance actively suppresses exactly the responses we want. Neoantigens arise from the patient's own proteome with a mutation; the more a neoantigen resembles self, the more likely the recognizing T cells were deleted in the thymus (central tolerance) or silenced in the periphery (anergy, regulatory T cells). The features that make a neoantigen safe to display are often the features that make it immunologically inert. Prediction has to reason about an absence — the clones that no longer exist — which is far harder than reasoning about a molecular fit.
Third, and most practically limiting, the training data are biased. The large, clean repositories of immunogenic epitopes are dominated by viral and pathogen-derived sequences and skew heavily toward a handful of common alleles such as HLA-A*02:01. Cancer neoepitopes are scarce, the confirmed-positive sets are small, and the negative class vastly outnumbers the positive one. Worse, pathogen epitopes and non-immunogenic cancer peptides can occupy overlapping ranges of binding and stability — so a model trained on viral data learns a decision boundary that does not transfer cleanly to the tumor setting. Recent work has argued that uncontrolled training-data bias not only misleads the models but also corrupts the benchmarks used to grade them.
The clearest diagnostic of the gap is concordance. On binding and presentation, prediction tools largely agree with one another. On immunogenicity, they diverge — different tools rank different peptides, and the overlap among their top candidates is modest. Agreement is a proxy for whether a problem is well-posed; the contrast between high binding-concordance and low immunogenicity-concordance is the field telling on itself.
The cost of this shows up as false positives. The TESLA consortium — a multi-institution ground-truthing effort that pooled tumor data and experimentally tested predicted epitopes — found that the positive predictive value of top-ranked neoantigens was low across participating groups, with only a small fraction of predicted peptides confirmed immunogenic. The encouraging half of that same study is just as important: by combining features of both presentation and T-cell recognition, the consortium built a model that filtered out roughly 98% of non-immunogenic peptides at a precision above 0.70, and ensembling across pipelines lifted predictive value substantially. The signal is real. It is just thinly distributed, and no single binding-centric tool captures it.
The productive frontier has largely abandoned the idea that a better binding score will solve immunogenicity. Instead, the work clusters around three bets, each attacking a different part of the problem above.
The first bet is multimodal modeling: fuse sequence, predicted 3D structure, and biochemical properties rather than relying on sequence alone. ImmunoStruct, published in Nature Machine Intelligence in 2025, is a clean example — it combines a sequence encoder, a graph transformer over AlphaFold-derived peptide–MHC structure, and biochemical features, trained on tens of thousands of peptide–MHC pairs, and uses contrastive learning to separate mutant from wild-type peptides. It improves both accuracy and interpretability over prior methods across viral epitopes and cancer neoepitopes.
The second bet is to model the TCR explicitly, because immunogenicity ultimately lives in the TCR–pMHC interaction. Transformer models such as TULIP, and protein-language-model approaches such as LANTERN (which pairs ESM embeddings with chemical representations of peptides), aim to predict recognition and to generalize in zero- and few-shot settings to epitopes not seen in training. The honest caveat, repeated across recent reviews, is that TCR-specificity models still generalize poorly to genuinely unseen epitopes once dataset biases and data leakage are controlled for — the paired TCR–pMHC data are too sparse and too skewed toward a few alleles and viral epitopes.
The third bet is to import general-purpose biology from foundation and protein language models and structure-based scoring, on the theory that a model which already "understands" proteins needs less labeled immunogenicity data to learn the residual signal. Modular protein-language-modelling pipelines for CD8+ immunogenicity are early instances of this idea.
- ImmunoStruct (Nature Machine Intelligence, 2025) — Multimodal model fusing peptide–MHC sequence, AlphaFold structure, and biochemical features.
- TESLA consortium — Key Parameters of Tumor Epitope Immunogenicity (Cell, 2020) — Consortium ground-truthing; low PPV for top-ranked neoantigens, large gains from presentation + recognition features.
- LANTERN — TCR–peptide binding via LLM representations (2025) — ESM + molecular embeddings for zero-/few-shot prediction on unseen epitopes.
- Modular protein language modelling for immunogenicity (PLOS Comp Biol, 2024) — Foundation-model approach to CD8+ epitope immunogenicity; discusses pathogen-vs-cancer distribution overlap.
- Beyond MHC binding: immunogenicity prediction tools to refine neoantigen selection — Review of why binding-derived metrics fail and how immunogenicity-focused tools improve selection.
Real progress is not a higher AUC on a benchmark that shares alleles and epitopes with its training set. That number can rise while clinical relevance does not. The benchmarks themselves are part of the problem: as long as they inherit the viral, HLA-A*02:01-heavy biases of the underlying databases, they reward models that have learned the bias rather than the biology.
The milestones that would actually matter are concrete. Prospective validation: a model that picks neoantigens which prove immunogenic in patients it never saw, not retrospective re-scoring. Generalization to held-out epitopes and rare alleles, with data-leakage controls, so that performance is not an artifact of memorization. And TCR-aware prediction that conditions on a patient's own repertoire — closing the loop between the molecule and the immune system that has to recognize it.
None of this argues against optimism. The trajectory is good: the field has correctly diagnosed that immunogenicity is not a binding problem, the data resources are growing, and the modeling tools — multimodal, structural, TCR-aware, foundation-model-based — are aimed at the right target for the first time. The gap between presentation and response is no longer a mystery; it is a well-characterized engineering problem with several credible attacks underway. That is exactly the state from which hard problems eventually yield. The honest position in 2026 is neither hype nor despair: presentation is solved, immunogenicity is not, and the distance between them is finally being measured in the right units.