For two decades, computational neoantigen design has been an exercise in scoring. You enumerate the mutated peptides a tumor might present, run each through a binding predictor trained on assay and mass-spec data, rank them, and hope the top of the list is immunogenic. The predictors got very good at the narrow task they were built for. They also inherited every limitation of the data underneath them: scarce measurements for most HLA alleles, noisy binding readouts, and a stubborn gap between 'binds the MHC' and 'provokes a T-cell response.'
Foundation models change the shape of the problem. Instead of learning binding from a few hundred thousand labeled examples, a protein language model arrives already fluent in the grammar of proteins, having read hundreds of millions of natural sequences. A structure model arrives knowing how peptides fold into a groove. The bet — increasingly a well-supported one — is that this pretrained prior is exactly what immune-recognition prediction has been starving for, and that it unlocks a shift from scoring candidates to generating them. This is squarely on our thesis, and it is moving fast. Here is an honest map of where it actually stands.
The center of gravity is the ESM lineage. EvolutionaryScale's ESM3, launched in June 2024, is a generative model that reasons jointly over protein sequence, structure, and function; its December 2024 sibling, ESM Cambrian (ESM C), is tuned instead for representation — producing embeddings of the underlying biology. Both descend from the ESM-2 masked-language-model family that made protein embeddings a default building block.
A telling result for our field: ESM-2, despite training on billions of sequences, performs near chance on pMHC binding out of the box. General protein fluency does not automatically encode MHC-specific binding rules. The fix is domain-specific continued pretraining — take a foundation model and keep training it on HLA-associated peptides. A 2025 study did exactly this, starting from ESM Cambrian (300M parameters) and continuing masked-language-model pretraining on HLA peptides to produce a binding predictor (ESMCBA) reported at a median Spearman correlation of 0.62 across 25 common HLA alleles. The most interesting finding was not the headline number but where the lift concentrated: alleles with moderate data availability benefited most, which is precisely the regime where conventional predictors are weakest.
Parallel to the ESM work, BERT-style transformers attack the harder, downstream question of T-cell recognition. TABR-BERT uses BERT-based transfer learning for TCR–pMHC interaction and reports gains on unseen epitopes; tcrLM is a lightweight TCR language model pretrained on more than 100 million TCR sequences; and TULIP, published in PNAS in 2024, is a transformer that learns from incomplete, mixed-quality data and generalizes to unseen epitopes — directly confronting the negative-data bias that has quietly inflated benchmark scores across the field. The common thread: pretraining on large unlabeled corpora to escape the tyranny of small, biased labeled sets.
Sequence models tell you what is likely to bind; structure tells you why, and increasingly whether. AlphaFold2 and AlphaFold3 — the latter released in 2024 with reduced dependence on multiple-sequence alignments — are now being benchmarked head-to-head on the genuinely hard target: the TCR–pMHC complex, the three-way interface that determines whether a presented peptide is actually seen by a T cell.
A 2025 benchmark on dozens of previously unseen complexes found AlphaFold3 delivered the best overall modeling and docking quality, with AlphaFold2 remaining competitive and accelerated variants achieving large speedups without much accuracy loss. More useful for designers than the raw rankings: the model's own confidence on the CDR3 loop — the most variable, recognition-critical part of the TCR — carried real functional signal, helping rerank predictions and flag mutation-induced affinity changes. That is structure prediction beginning to double as a functional filter, not just a geometry calculator.
The honest caveat is that the TCR–pMHC interface remains one of the harder problems in structural biology. These tools are good and improving, but they are not yet a substitute for experimental confirmation that a designed peptide is immunogenic.
The most consequential shift is conceptual: from ranking a fixed candidate list to inventing the list. Two 2025 lines of work make this concrete. PMGen couples AlphaFold2-based peptide–MHC structure prediction to structure-guided sequence design across both MHC class I and II, using template-engineering tricks to enforce anchor constraints; the authors report sub-angstrom peptide-core RMSDs and show that fine-tuning ProteinMPNN on PMGen-modeled structures markedly improves sequence recovery. The framing matters — it is positioned as a route to neoantigen generation, not just prediction.
A second line uses diffusion models to generate pMHC-I peptide libraries conditioned on crystal-structure interaction distances, spanning 27 high-priority HLA alleles. Built independently of previously characterized peptides, the designs nonetheless reproduce canonical anchor-residue preferences and, when generated with structure-conditioned RFdiffusion, cluster near experimentally validated epitopes in latent space. Because they are conditioned on atomic contacts rather than mass-spec or assay tables, these generators sidestep the data biases baked into the older approach — a genuinely different failure mode.
This is the inflection our thesis has been pointing at: design tools that propose structurally plausible, allele-matched peptides on demand. The results are computational and in-silico-validated. The leap to validated patient neoantigens still requires the wet lab.
A quieter implication deserves billing. Conventional predictors are trained on data heavily skewed toward European-ancestry HLA alleles; independent evaluation has shown tools like NetMHCpan degrade on non-European alleles absent from training data, with Asian, African, and Middle Eastern populations the most under-served. For a personalized therapy keyed to a patient's own HLA type, that is not an abstract fairness concern — it is differential clinical performance by ancestry.
Foundation models offer two structural reasons for optimism. First, transfer: a model that has learned general protein and peptide biology can carry useful prior into low-resource alleles, and the continued-pretraining result above showed the largest gains exactly in moderate-data regimes. Second, structure-first generation: a diffusion or AlphaFold-coupled designer conditioned on the physical geometry of a groove does not, in principle, need a large per-allele assay dataset to propose binders — it can reason from structure outward. Multimodal models such as ImmunoStruct, which fuses sequence, structure, and biochemical features over roughly 27,000 peptide-MHCs to predict immunogenicity, point the same direction: less reliance on any single biased data table.
The dividend is real in principle and partly demonstrated in practice. It is not yet banked for the rarest alleles, which remain data-starved by every method. But for the first time the trajectory bends toward coverage rather than away from it.
Real, today: protein-LM embeddings and continued pretraining measurably help pMHC binding prediction, especially for under-served alleles; AlphaFold3 sets the bar for TCR–pMHC structural modeling and its confidence scores carry functional signal; BERT-style models generalize better to unseen epitopes than the supervised predecessors that overfit biased negatives; and generative tools can now produce structurally faithful, anchor-correct peptides in silico.
Promised, not proven: that any of this reliably predicts clinical immunogenicity — whether a designed or top-ranked neoantigen actually drives a protective T-cell response in a patient. Binding is necessary, not sufficient; presentation is necessary, not sufficient; even predicted TCR engagement is a long way from a durable response. The benchmarks are largely retrospective and in silico, the hardest alleles remain thin, and TCR–pMHC structure prediction is still error-prone at the interface that matters most. Anyone selling a foundation model as a finished neoantigen-selection oracle is ahead of the evidence.
The honest read is that foundation models have not solved neoantigen design — they have changed what kind of problem it is. The field is migrating from scoring a fixed list with biased predictors to generating allele-matched candidates from learned priors and physical structure, with a credible path to serving patients that legacy tools systematically underserved. That is a better problem to have.
What it is not yet is a clinical shortcut. The decisive experiments — do these designed and prioritized peptides provoke real, protective responses, across diverse patients — are still ahead of us. The right posture is the one the best groups in this space already hold: build the generative toolchain aggressively, validate it ruthlessly, and resist the temptation to mistake a strong in-silico benchmark for a cured patient. The trajectory is genuinely exciting. The maturity is early. Both things are true.
- ESMCBA — continued domain-specific pretraining of protein LMs for pMHC-I binding (2025) — ESM Cambrian continued-pretraining; gains concentrated in moderate-data alleles
- PMGen — from peptide-MHC structure prediction to peptide generation (2025) — AlphaFold2 + structure-guided design; code at github.com/soedinglab/PMGen
- Structure-guided pMHC-I library generation with diffusion models (2025) — Diffusion/RFdiffusion conditioned on crystal-structure contacts
- ImmunoStruct — multimodal sequence+structure+biochemical immunogenicity model — Nature Machine Intelligence; ~27,000 pMHCs
- TULIP — transformer LM for TCR–epitope binding, generalizes to unseen epitopes (PNAS 2024) — Tackles negative-data selection bias
- TABR-BERT — BERT transfer learning for TCR–pMHC interaction — Strong on unseen epitopes