Tutorial 2: Finetuning Bert for Sequence Classification using a LoRA adapter#

When we import a pretrained transformer model from HuggingFace, we receive encoder weights that are not directly optimized for our downstream task. For sequence classification, we add a classifier head and fine-tune. In this tutorial, we compare two approaches on IMDb sentiment classification:

  1. Full Supervised Fine-Tuning (SFT)

  2. Parameter Efficient Fine-Tuning (PEFT) with LoRA

Run this tutorial#

From the repository root:

uv run python docs/source/modules/documentation/tutorials/tutorial_2_lora_finetune.py

Expected terminal output (excerpt)#

A successful run should include output similar to:

============================================================
Tutorial 2: LoRA Finetuning on BERT (IMDb)
============================================================
[1/7] Loading and tokenizing IMDb dataset...
      Dataset loaded: 25000 train / 25000 test
[2/7] Loading model and building MaseGraph...
      MaseGraph ready  ✓
[3/7] Reporting trainable parameters (full model)...
      Trainable after freezing embeddings: 413,314
[4/7] Evaluating baseline accuracy (before training)...
      [Baseline] Accuracy: 0.4923
[5/7] Running full SFT (1 epoch)...
      [SFT] Accuracy after 1 epoch: 0.8193
      SFT checkpoint saved to .../tutorial_2_sft
[6/7] Injecting LoRA adapter and training (1 epoch)...
      Trainable params with LoRA: 440,844
      [LoRA] Accuracy after training: 0.8350
[7/7] Fusing LoRA weights and exporting...
      [LoRA fused] Accuracy: 0.8350
      LoRA checkpoint saved to .../tutorial_2_lora
============================================================
Tutorial 2 complete!
============================================================

Note

During metadata initialization and training, some environments print long tensor dumps and warnings. This does not indicate failure as long as the script reaches Tutorial 2 complete!.

Raw output examples from a real run#

The following snippets are copied from a real terminal run log.

Trainable-parameter report excerpt:

+-------------------------------------------------+------------------------+
| Submodule                                       |   Trainable Parameters |
+=================================================+========================+
| bert                                            |                4385920 |
+-------------------------------------------------+------------------------+
| bert.embeddings                                 |                3972864 |
+-------------------------------------------------+------------------------+
| bert.embeddings.word_embeddings                 |                3906816 |
+-------------------------------------------------+------------------------+
| bert.embeddings.token_type_embeddings           |                  ...   |
+-------------------------------------------------+------------------------+

Tensor dump excerpt (truncated):

tensor([[[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
          ... ]]]])

Sentiment Analysis with the IMDb Dataset#

The IMDb dataset (50k reviews, binary labels) is a standard sentiment-analysis benchmark. A positive review example from the dataset:

I turned over to this film in the middle of the night and very nearly skipped right passed it. It was only because there was nothing else on that I decided to watch it. In the end, I thought it was great. An interesting storyline, good characters, a clever script and brilliant directing makes this a fine film to sit down and watch.

Step 1: Load dataset and tokenizer#

print("\n[1/7] Loading and tokenizing IMDb dataset...", flush=True)
from chop.tools import get_tokenized_dataset, get_trainer

dataset, tokenizer = get_tokenized_dataset(
    dataset=dataset_name,
    checkpoint=tokenizer_checkpoint,
    return_tokenizer=True,
)
print(f"      Dataset loaded: {len(dataset['train'])} train / {len(dataset['test'])} test", flush=True)

Generate a MaseGraph with Custom Arguments#

For HuggingFace models, the MaseGraph tracer can be driven with explicit hf_input_names. In this tutorial we trace with input_ids, attention_mask and labels. Including labels ensures the loss path is part of the traced graph.

Step 2: Build MaseGraph#

print("\n[2/7] Loading model and building MaseGraph...", flush=True)
from transformers import AutoModelForSequenceClassification
import chop.passes as passes
from chop import MaseGraph

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model.config.problem_type = "single_label_classification"

mg = MaseGraph(
    model,
    hf_input_names=["input_ids", "attention_mask", "labels"],
)
mg, _ = passes.init_metadata_analysis_pass(mg)
mg, _ = passes.add_common_metadata_analysis_pass(mg)
print("      MaseGraph ready  ✓", flush=True)

Task: Remove attention_mask and labels from hf_input_names, regenerate the graph, and compare topology differences. Explain why the graph changes.

Full Supervised Finetuning (SFT)#

Before training, inspect trainable parameter distribution. Most trainable parameters are in embeddings, so we freeze embedding parameters before the main comparison.

Step 3: Report trainable parameters#

print("\n[3/7] Reporting trainable parameters (full model)...", flush=True)
from chop.passes.module import report_trainable_parameters_analysis_pass

_, _ = report_trainable_parameters_analysis_pass(mg.model)

# Freeze embeddings
for param in mg.model.bert.embeddings.parameters():
    param.requires_grad = False
trainable = sum(p.numel() for p in mg.model.parameters() if p.requires_grad)
print(f"      Trainable after freezing embeddings: {trainable:,}", flush=True)

Before fine-tuning, accuracy is close to random guessing for a binary dataset (around 50%).

Step 4: Baseline evaluation#

print("\n[4/7] Evaluating baseline accuracy (before training)...", flush=True)
trainer = get_trainer(
    model=mg.model,
    tokenized_dataset=dataset,
    tokenizer=tokenizer,
    evaluate_metric="accuracy",
)
eval_results = trainer.evaluate()
print(f"      [Baseline] Accuracy: {eval_results['eval_accuracy']:.4f}", flush=True)

Step 5: Run one epoch of full SFT#

print("\n[5/7] Running full SFT (1 epoch)...", flush=True)
trainer.train()
eval_results = trainer.evaluate()
print(f"      [SFT] Accuracy after 1 epoch: {eval_results['eval_accuracy']:.4f}", flush=True)

mg.export(f"{Path.home()}/tutorial_2_sft")
print(f"      SFT checkpoint saved to {Path.home()}/tutorial_2_sft", flush=True)

Parameter Efficient Finetuning (PEFT) with LoRA#

LoRA uses low-rank matrices A and B to adapt pretrained weights while freezing most original parameters. This reduces trainable parameter count and memory footprint while retaining strong task performance.

../../../_images/lora_adapter.png

LoRA adapter structure.#

Step 6: Inject LoRA adapter and train#

print("\n[6/7] Injecting LoRA adapter and training (1 epoch)...", flush=True)
mg, _ = passes.insert_lora_adapter_transform_pass(
    mg,
    pass_args={"rank": 6, "alpha": 1.0, "dropout": 0.5},
)

trainable_lora = sum(p.numel() for p in mg.model.parameters() if p.requires_grad)
print(f"      Trainable params with LoRA: {trainable_lora:,}", flush=True)

trainer = get_trainer(
    model=mg.model,
    tokenized_dataset=dataset,
    tokenizer=tokenizer,
    evaluate_metric="accuracy",
    num_train_epochs=1,
)
trainer.train()
eval_results = trainer.evaluate()
print(f"      [LoRA] Accuracy after training: {eval_results['eval_accuracy']:.4f}", flush=True)

After LoRA training, we fuse adapter weights back into linear layers for inference efficiency.

Step 7: Fuse LoRA weights and export#

print("\n[7/7] Fusing LoRA weights and exporting...", flush=True)
mg, _ = passes.fuse_lora_weights_transform_pass(mg)
eval_results = trainer.evaluate()
print(f"      [LoRA fused] Accuracy: {eval_results['eval_accuracy']:.4f}", flush=True)

mg.export(f"{Path.home()}/tutorial_2_lora")
print(f"      LoRA checkpoint saved to {Path.home()}/tutorial_2_lora", flush=True)

print("\n" + "=" * 60, flush=True)
print("Tutorial 2 complete!", flush=True)
print("=" * 60, flush=True)

Conclusion#

Tutorial 2 demonstrates the trade-off between full SFT and LoRA-based PEFT:

  • Full SFT gives a strong improvement over baseline.

  • LoRA reaches comparable or better accuracy with far fewer effective trainable updates.

  • Both SFT and LoRA checkpoints are exported for follow-up tutorials.