Lab 4 (Software Stream) Performance Engineering

Lab 4 (Software Stream) Performance Engineering#

ELEC70109/EE9-AML3-10/EE9-AO25
Written by Aaron Zhao , Cheng Zhang , Pedro Gimenes

In this lab, you will learn how to optimize ML code. We will go through three approaches:

Use high-level framework such as torch.compile to optimize user code, and understand the basic building blocks in such optimization frameworks.
Understand the effect of fused kernels, and test it with existing upstream implementations in Pytorch.
Understand how to port custom CUDA kernels into Pytorch, and test their performances.

In the first part of Lab 4 (torch.compile), we did not really observe real run-time speedups with torch.compile.
1. Modify the code and investigate why this is the case?
2. If you change the device to cuda, do you observe the same thing?
In the second part of Lab 4 (kernel fusion), we looked at a fused SDPA kernel.
1. Now, extend the profiling to the SDPA kernel, compare its runtime behavior with the naive implementation.
2. If you change the device to cuda, do you observe the same thing?
In the third part of lab4 (Custom kernel), we go through how to write MXINT8 dequantization kernel and bind it to Python.
1. How does MXINT8 benefit custom hardware if both the activation and weights in a linear layer are quantized to MXINT8?
2. What is the purpose of the variable dont_need_abs and bias in the C++ for loop?
3. How does cta_tiler partition data for copying to shared memory in CUDA kernel? How does layout_sX partition threads in a threadlock for computation? (Challenge)
4. Why the saved GPU memory is not exactly (32 - (4+8/32))/32 = 86.7% of the FP32 model?