Training Genomic Foundation Models¶

▶ Try in Colab

Once we have a tokenizer and an architecture (Encoder-only), we need to train the model. This involves Pretraining on massive unlabeled datasets and Fine-Tuning on specific labeled tasks.

Pretraining Objectives¶

The goal of pretraining is to force the model to learn the syntax (grammar) and semantics (regulatory motifs) of DNA without human labels.

Masked Language Modeling (MLM)¶

This is the standard objective for Encoder models (like BERT).

Process: Randomly mask a percentage (e.g., 15%) of the k-mer tokens in the input sequence.
Goal: Predict the original identity of the masked tokens based on the surrounding context.
Formula: \(P(x_i \mid x_{\setminus i})\)
Biological Intuition: To predict a masked region, the model must learn patterns like "TATA box is usually followed by a transcription start site" or "this motif pairs with that motif."

Template Code: Applying MLM Masking¶

import torch
from transformers import AutoTokenizer, DataCollatorForLanguageModeling

model_name = "InstaDeepAI/nucleotide-transformer-v2-50m-multi-species"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# The DataCollator handles random masking automatically (15% by default)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15  # mask 15% of k-mer tokens
)

# Example: tokenize a DNA sequence, then apply masking
sequence = "ACGTACGTACGTACGTACGT"
tokenized = tokenizer(sequence, return_tensors="pt")

# Collate into a batch and apply masking
batch = data_collator([{"input_ids": tokenized["input_ids"][0]}])
print("Original IDs: ", tokenized["input_ids"])
print("Masked IDs:   ", batch["input_ids"])   # Some tokens replaced with [MASK]
print("Labels:       ", batch["labels"])        # -100 means 'not masked, ignore'