Protein Structure Tokenization via Geometric Byte Pair Encoding

Protein structure is central to biological function. Enabling structure-native foundation models requires representations that discretize continuous backbone geometry while preserving global consistency and multi-scale organization. Existing protein structure tokenizers rely on fixed-size codebooks or continuous latent vectors, limiting interpretability, resolution control, and transfer across architectures.

Here we introduce GeoBPE, a geometry-grounded protein structure tokenizer inspired by byte-pair encoding. GeoBPE transforms continuous backbone conformations into discrete, hierarchical “sentences” of structural motifs. At each step, GeoBPE identifies frequent motif pairs, clusters them via k-medoids to form representative prototypes, and replaces occurrences with learned geometric primitives. To prevent geometric drift introduced by local quantization, GeoBPE performs differentiable inverse kinematics to optimize boundary glue angles under an SE(3) end-frame loss, preserving global fold integrity.

We evaluate GeoBPE on large-scale structural datasets and functional tasks. GeoBPE achieves strong compression–distortion tradeoffs, robust out-of-distribution generalization, and data-efficient training. It improves representation learning across binding site prediction, fold classification, and structural property prediction. Tokens align with CATH functional families and support expert-interpretable case studies. GeoBPE establishes a principled foundation for structure-native protein language models.

Overview of GeoBPE

Protein language models trained on sequence data capture evolutionary constraints but do not explicitly encode backbone geometry. Modeling structure directly requires discretizing continuous 3D conformations into symbolic units that preserve fold-level consistency and functional organization. The core question is how to transform noisy, multi-scale backbone geometry into discrete tokens without sacrificing global structure.

An effective protein structure tokenizer should satisfy three properties:

  • Hierarchical vocabulary: Learn reusable structural motifs that compose into higher-order fold elements.
  • Geometric fidelity: Preserve global SE(3) consistency after local quantization.
  • Multi-resolution control: Allow adjustable compression–distortion tradeoffs across tasks.

Existing approaches based on vector quantization use fixed-size codebooks and latent embeddings. These representations lack hierarchical structure and provide limited control over resolution.

GeoBPE addresses this gap by extending byte-pair encoding to continuous protein geometry. It alternates between local motif merging and global geometric correction.

  1. Motif discovery: GeoBPE identifies frequent adjacent motif pairs (Geo-Pairs), clusters their occurrences via k-medoids, and introduces representative geometric prototypes into a growing vocabulary.

  2. Hard quantization: Each occurrence is replaced with its nearest medoid prototype, producing discrete motif tokens.

  3. Glue-aware refinement: Quantization introduces geometric drift. GeoBPE corrects this by optimizing boundary glue angles using differentiable inverse kinematics under an SE(3) end-frame loss.

The result is a hierarchical merge tree that segments a backbone into multi-scale structural motifs. The learned vocabulary provides an interpretable representation of protein structure.

GeoBPE: Geometry-Grounded Byte-Pair Encoding

GeoBPE builds a discrete structural alphabet while preserving fold-level consistency. Given backbone coordinates, it proceeds in iterative merge steps.

GeoBPE begins at the residue level. Each residue is clustered via RMSD into representative prototypes, forming the initial vocabulary. The backbone is rewritten using these residue-level motifs.

At each iteration:

  • The most frequent Geo-Pair is identified.
  • All occurrences are gathered across the dataset.
  • K-medoids clustering produces representative prototypes.
  • Occurrences are replaced by their assigned prototypes.
  • The merge hierarchy is updated.

This process mirrors byte-pair encoding but operates in geometric space rather than symbol space. Vocabulary size and resolution are controlled through the number of medoids and merge iterations.

Replacing motif pairs with medoid prototypes introduces geometric drift. Without correction, accumulated local errors distort the global fold. GeoBPE addresses this through glue-aware refinement.

Each motif boundary is parameterized by three glue angles. After quantization, GeoBPE optimizes these angles via differentiable forward kinematics to minimize an SE(3) end-frame loss between reconstructed and original structures.

This step:

  • Preserves global backbone consistency.
  • Prevents drift accumulation across merges.
  • Enables stable multi-step hierarchical decomposition.

Rigid-body refinement is essential. Removing glue optimization substantially increases RMSD and degrades fold integrity.

Multi-Resolution and Architecture-Agnostic Protein Tokenizer

GeoBPE supports adjustable resolution by construction. As the vocabulary grows, GeoBPE captures a larger fraction of backbone variability and improves reconstruction fidelity. With fewer merges and a smaller vocabulary, it produces coarser motifs that favor abstraction and representation learning. This makes resolution controllable, allowing the same tokenizer to serve compression, downstream transfer, or structure language modeling depending on the target use case.

GeoBPE is also architecture agnostic. It outputs a hierarchical merge tree that can coarsen residue-level embeddings from large protein language models into motif-level and protein-level representations, and it can be paired with a transformer to model structure tokens and generate backbones by language modeling.

Because GeoBPE tokens are observed structural medoids, they remain interpretable rather than opaque latent vectors. The resulting motifs align with functional domain boundaries and recurrent structural patterns.

Publication

Protein Structure Tokenization via Geometric Byte Pair Encoding
Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Marinka Zitnik
International Conference on Learning Representations, ICLR 2026

@article{sun2026protein,
  title={Protein Structure Tokenization via Geometric Byte Pair Encoding},
  author={Sun, Michael and Yuan, Weize and Liu, Gang and Matusik, Wojciech and Zitnik, Marinka},
  journal={International Conference on Learning Representations, ICLR},
  year={2026}
}

Code and Data Availability

Pytorch implementation of GeoBPE is available in the GitHub repository.

Authors

Latest News

Feb 2026:   Overton Prize

Our research has been recognized with the 2026 Overton Prize.

Feb 2026:   'Act or Defer' Foundation Models

Feb 2026:   Reasoning Model for Longitudinal Data

Feb 2026:   Context Switching AI in Nature Medicine

Jan 2026:   Zoom-Out and Zoom-In Retrieval for LLMs

Much of the world’s knowledge lies outside public web text accessible to LLMs, including internal ontologies, curated catalogs, drug safety tables, patient health data, and lab knowledge bases. ARK helps an LLM to choose, one step at a time, whether to look broadly for relevant information or to dig deeper by following specific links in the data.

Jan 2026:   AI Scientist for Therapeutic Discovery

Jan 2026:   AI Scientists - LLMs Using Scientific Tools

Excited about this academic collaboration with Anthropic on adding connectors to ToolUniverse to make Claude even more powerful for scientific discovery.

Dec 2025:   AI + Validation in Molecular, Organoid, and Clinical Systems

Dec 2025:   Digital Twinning

A piece in Harvard Gazette on digital twins, cellular chatbots, and building digital twins at a cellular scale.

Dec 2025:   Virtual Cells and Instruments

We are excited to meet hundreds of researchers attending our AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development workshop at NeurIPS 2025.

Dec 2025:   CUREBench

Excited to see 1,622 researchers from around the world entering our CUREBench Challenge with 398 participating teams that made 3,383 submissions to the competition and submitted 8,457,500+ AI reasoning traces for therapeutics. Join us at the Award Ceremony at NeurIPS.

Dec 2025:   AI For Science at NeurIPS

Join us and hundreds of other scientists at the 6th AI for Science workshop at NeurIPS.

Nov 2025:   Protein Structure Tokenization

Nov 2025:   Generative AI Model for Spatial Biology

Nov 2025:   AI Cell Models

A piece in Science explores how AI cell models could transform biomedicine (if they work as promised) and highlights ToolUniverse. ToolUniverse lets AI co-scientists test, analyze, and build on AI cell models.

Oct 2025:   Is AI sycophancy holding science back?

A piece in Nature explores how AI sycophancy, in which models agree too much with users instead of reasoning on its own, could affect the use of AI in medical research.

Oct 2025:   Our research featured by Kempner and Crimson

A news story about PDGrapher in Harvard Crimson. ToolUniverse featured on the Kempner Institute blog.

Oct 2025:   A Scientist's Guide to AI Agents in Nature

A piece on AI agents in Nature highlights ongoing projects in our group, including methods for evaluating scientific hypotheses, challenges in benchmarking AI agents, and the open ToolUniverse ecosystem.

Zitnik Lab  ·  Artificial Intelligence in Medicine and Science  ·  Harvard  ·  Department of Biomedical Informatics