Mixed-precision search on manual model

This tutorial shows how to search for mixed-precision quantization strategy for OPT model on Wikitext2 dataset.

Note: Manual model refers to the model named as <model_arch>_quantized at mase-tools/machop/chop/models/manual. Usually these are models that cannot be directly converted to MASE Graph.

Search for Mixed-Precision Quantization Scheme

What is included in this search:

  • The checkpoint “facebook/opt-125m” is loaded from HuggingFace.

  • A search space is built for OPT-125M, where each matmul/linear layer operand may have a distinct precision.

  • The search is launched. In each trial:

    • A quantization config (q_config) is sampled from the search space.

    • The pretrained OPT-125M is quantized with q_config

    • Software runner evaluates the quantized OPT and return some metrics. In this example, the perplexity on WikiText2 is returned.

    • Hardware runner evaluates the quantized OPT and return some metrics. In this example, the average bitwidth is returned.

    • The trial objective is calculated.

and search for fixed-point precision on Wikitext2 dataset.

Search config

Here is the search part in configs/examples/search_opt_quantized_tpe_search.toml looks like the following.

# the search space name defined in mase
name = "module/manual_hf/quantize/llm_mixed_precision_ptq"

model_parallel = false

# Since we are doing mixed-precision search.
# Only one "name" is allowed (len(name) == 1)
name = ["integer"]
# precision search space is specified using the following lists
data_in_width = [2, 4, 8, 10]
data_in_frac_width = [2, 4, 6]
weight_width = [2, 4, 8, 10]
weight_frac_width = [2, 4, 6]
bias_width = [2, 4, 8, 10]
bias_frac_width = [2, 4, 6]

name = "optuna"
eval_mode = true

# software (sw) runner and hardware (hw) runner evaluates the quantized model to guide the search
# here we evaluate the perplexity and average bitwidth of the quantized model
data_loader = "val_dataloader"
num_samples = 512

compare_to = 32 # compare to FP32

# evaluating perplexity requires GPUs so we only launch 1 job.
n_jobs = 1
# we run 10 trials in total for demostration.
n_trials = 10
timeout = 20000
# Optuna supports a range of search algorithms, including Random, TPE, Genetic, etc.
sampler = "TPE"
model_parallel = false
sum_scaled_metrics = false # false for multi-objective, true for single objecive

perplexity.scale = 1.0
perplexity.direction = "minimize"
average_bitwidth.scale = 1.0
average_bitwidth.direction = "minimize"

Search Logs

The complete search results will be saved in mase/mase_output/opt_quantized_wikitext2/software/search_ckpts/log.json.

Here is part of the log.json recording all search details.

For example, log["0"]["user_attrs_sampled_config"] is the sampled quantization config of trial 0. Expand it and you will set the precision of each matmul/linear layer’s operands.
