Lab 4 (Software Stream) Performance Engineering#
ELEC70109/EE9-AML3-10/EE9-AO25
Written by
Aaron Zhao ,
Cheng Zhang ,
Pedro Gimenes
General introduction#
In this lab, you will learn how to optimize ML code. We will go through three approaches:
Use high-level framework such as torch.compile to optimize user code, and understand the basic building blocks in such optimization frameworks.
Understand the effect of fused kernels, and test it with existing upstream implementations in Pytorch.
Understand how to port custom CUDA kernels into Pytorch, and test their performances.
Learning tasks#
Go through “Lab 4 for Advanced Deep Learning Systems (ADLS) - Software Stream” to understand how to use optimize your ML model.
Implementation tasks#
In the lab, we did not really observe real run-time speedups with torch.compile.
Modify the code and investigate why this is the case?
If you change the device to cuda, do you observe the same thing?
In the second part of lab4, we looked at a fused SDPA kernel.
Now, extend the profiling to the SDPA kernel, compare its runtime behavior with the naive implementation.
If you change the device to cuda, do you observe the same thing?