Lab 1 for Advanced Deep Learning Systems (ADLS)#
ELEC70109/EE9-AML3-10/EE9-AO25
Written by
Aaron Zhao ,
Cheng Zhang ,
Pedro Gimenes
General introduction#
In this lab, you will learn how to use the basic functionalities of the software stack of MASE. There are in total 5 tasks you would need to finish.
Preparation and installations#
Make sure you have read and understood the installation of the framework, detailed here.
Both streams (software and hardware) would start from software first. We are starting with simple toy networks that can train on most laptops.
The recommendation is to fork this repository to your own Github
account. If you are not familiar with git
, github
and
terminologies like fork
. You might find this
webpage
useful. You might also want to work on your own
branch.
Please be aware that pushing to the main
branch would be blocked,
and your final project would be submitted as a pull
request.
Getting familiar with the Machop Command Line#
The software framework of MASE is called Machop, you can find a reason for the naming in the Machop Readme.
After installation, you should be able to run
./ch --help
You should see a print out of a list of options and a usage guide, this is also a good test for your installation. If your installation is incorrect, this command will throw you an error.
The print out has the following lines:
main arguments:
action action to perform. One of (train|test|transform|search)
model name of a supported model. Required if configuration NOT provided.
dataset name of a supported dataset. Required if configuration NOT provided.
This means the ch
command line tool expects three compulsory inputs
that are action
, model
and dataset
respectively. These three
components would have to be either defined in the command line interface
or be defined in a configuration file.
# example
# train is an action, mnist is the dataset and toynet is the model name.
# you do not have to run the command for now.
./ch train toy mnist
The command line interface also allows you to input additional arguments to control the training and testing flow.
# example
# setting the maximum training epochs and batch size through the cmd line interface.
# you do not have to run the command for now.
./ch train toy mnist --max-epochs 200 --batch-size 256
Training your first network#
In this section, we are interested in training a small network and evaluate the trained network through the command line flow.
The dataset we look at is the Jet Substructure Classification (JSC) dataset.
Note
[A bit of physics] Jets are collimated showers of particles that result from the decay and hadronization of quarks q and gluons g. At the Large Hadron Collider (LHC), due to the high collision energy, a particularly interesting jet signature emerges from overlapping quark-initiated showers produced in decays of heavy standard model particles. It is the task of jet substructure to distinguish the various radiation profiles of these jets from backgrounds consisting mainly of quark (u, d, c, s, b) and gluon-initiated jets. The tools of jet substructure have been used to distinguish interesting jet signatures from backgrounds that have production rates hundreds of times larger than the signal.
In short, the dataset contains inputs with a feature size of 16 and 5 output classes.
The train command#
To train a network for the JSC dataset, you would need to run:
# You will need to run this command
./ch train jsc-tiny jsc --max-epochs 10 --batch-size 256
--max-epochs
states the maximum epochs allowed to train, and
--batch-size
defines the batch size for training.
You should see a print out of the training configuration in a table
+-------------------------+--------------------------+-----------------+--------------------------+
| Name | Default | Manual Override | Effective |
+-------------------------+--------------------------+-----------------+--------------------------+
| task | classification | | classification |
| load_name | None | | None |
| load_type | mz | | mz |
| batch_size | 128 | 256 | 256 |
| to_debug | False | | False |
| log_level | info | | info |
| seed | 0 | | 0 |
| training_optimizer | adam | | adam |
| trainer_precision | 32 | | 32 |
| learning_rate | 1e-05 | | 1e-05 |
| weight_decay | 0 | | 0 |
| max_epochs | 20 | 10 | 10 |
| max_steps | -1 | | -1 |
| accumulate_grad_batches | 1 | | 1 |
| log_every_n_steps | 50 | | 50 |
| num_workers | 16 | 0 | 0 |
| num_devices | 1 | | 1 |
| num_nodes | 1 | | 1 |
| accelerator | auto | | auto |
| strategy | ddp | | ddp |
| is_to_auto_requeue | False | | False |
| github_ci | False | | False |
| disable_dataset_cache | False | | False |
| target | xcu250-figd2104-2L-e | | xcu250-figd2104-2L-e |
| num_targets | 100 | | 100 |
| is_pretrained | False | | False |
| max_token_len | 512 | | 512 |
| project_dir | /Users/aaron/Projects/ma | | /Users/aaron/Projects/ma |
| | se-tools/mase_output | | se-tools/mase_output |
| project | None | | None |
| model | None | jsc-tiny | jsc-tiny |
| dataset | None | jsc | jsc |
+-------------------------+--------------------------+-----------------+--------------------------+
There is also a summary on the model
| Name | Type | Params
-------------------------------------------------
0 | model | JSC_Tiny | 127
1 | loss_fn | CrossEntropyLoss | 0
2 | acc_train | MulticlassAccuracy | 0
3 | acc_val | MulticlassAccuracy | 0
4 | acc_test | MulticlassAccuracy | 0
5 | loss_val | MeanMetric | 0
6 | loss_test | MeanMetric | 0
-------------------------------------------------
127 Trainable params
0 Non-trainable params
127 Total params
0.001 Total estimated model params size (MB)
Logging on tensorboard#
As you can see, this is a toy model and it is very small. There is another print out line that is also very useful:
Project will be created at /home/cheng/GTA/adls/mase-tools/mase_output/jsc-tiny_classification_jsc_2023-10-30
Missing logger folder: /Users/aaron/Projects/mase-tools/mase_output/jsc-tiny_classification_jsc_2023-10-19/software/training_ckpts/logs
For any training commands executed, a logging directory would be created and one can use tensorboard to check the training trajectory.
# You need to run the following command with your own edits
# the actual name of the log file created for you would be different from the example showing here, because the name contains a time-stamp.
# --port 16006 is declaring the port on localhost
tensorboard --logdir ../mase_output/jsc-tiny_classification_jsc_2023-10-19/software --port 16006
Open http://localhost:16006/
in your preferred browser, explore on
the entries that have been logged.
The test command#
Under the same folder
../mase_output/jsc-tiny_classification_jsc_2023-10-19/software
,
there are also saved checkpoint files for the trained models. These are
basically the trained parameters of the model, one can find more detail
on Pytorch model checkpointing
here
and Lightning checkpointing
here.
./ch test jsc-tiny jsc --load ../mase_output/jsc-tiny_classification_jsc_2023-10-19/software/training_ckpts/best.ckpt --load-type pl
The above command would return you the performance of the trained model
on the test set. --load-type pl
tells Machop that the checkpoint is
saved by PyTorch Lightning. For PyTorch Lightning, see this
section
The saved checkpoint can also be used to resume training.
The definition of the JSC dataset#
Datasets are defined in under the
dataset folder in chop
, one should
take a look at the
__init__.py to understand
how different datasets are declared. The JSC dataset is defined and
detailed in this
file:
@add_dataset_info(
name="jsc",
dataset_source="manual",
available_splits=("train", "validation", "test"),
physical_data_point_classification=True,
num_classes=5,
num_features=16,
)
class JetSubstructureDataset(Dataset):
def __init__(self, input_file, config_file, split="train"):
super().__init__()
...
The
decorator
(if you do not know what is a python decorator, click the link and
learn) defines the dataset information required. The class object
JetSubstructureDataset
has Dataset
being its parent class. If
you are still concerned with your proficiency in OOP (object orientated
programming), you should check this
link.
The definition of the JSC Tiny network#
The network definition can also be found in the __init__.py
class JSC_Tiny(nn.Module):
def __init__(self, info):
super(JSC_Tiny, self).__init__()
self.seq_blocks = nn.Sequential(
# 1st LogicNets Layer
nn.BatchNorm1d(16), # batch norm layer
nn.Linear(16, 5), # linear layer
)
def forward(self, x):
return self.seq_blocks(x)
Network definitions in Pytorch normally contains two components: an
__init__
method and a forward
method. Also all networks and
custom layers in Pytorch has to be a subclass of nn.Module
. The
neural network layers are initialised in __init__
. Every
nn.Module
subclass implements the operations on input data in the
forward
method.
nn.Sequential
is a container used for wrapping a number of layers
together, more information on this container can be found in this
link.
Varying the parameters#
We have executed the following training command:
./ch train jsc-tiny jsc --max-epochs 10 --batch-size 256
We can, apparently, tune a bunch of parameters, and the obvious ones to tune are
batch-size
max-epochs
learning-rate
Tune these parameters by hand and answer the following questions:
What is the impact of varying batch sizes and why?
What is the impact of varying maximum epoch number?
What is happening with a large learning and what is happening with a small learning rate and why? What is the relationship between learning rates and batch sizes?
A deeper dive into the framework#
When you execute ./ch
, what really happens is the
ch file got executed and from the import
you
can tell it is calling into cli.py.
The entry point for the train/reset action#
When you choose to execute ./ch train
, we are executing the train
action, and invoking train.py.
The entire training flow is orchestrated using PyTorch
Lightning, so that the detailed lightning
related wrapping occurs in
jet_substructure.py.
PyTorch Lightning’s checkpointing callbacks saves the model parameters
(torch.nn.Module.state_dict()
), the optimizer states, and other
hyper-parameters specified in lightning.pl.LightningModule
, so that
the training can be resumed from the last checkpoint. The saved
checkpoint has extension .ckpt
, this is why we have
--load-type pl
in the ./ch test
command.
Test action has similar implementation based on PyTorch Lightning (test.py)
The entry point for the model#
All models are defined in the
__init__.py under the model
folder. The get_model
function is called inside actions
(such as
train
) to ping down different models.
The entry point for the dataset#
Similar to the model definitions, all datasets are defined in the __init__.py under the dataset folder.
Train your own network#
Now you are familiar with different components in the tool.
Implement a network that has in total around 10x more parameters than the toy network.
Test your implementation and evaluate its performance.
Google Colab Adaption#
lab1.ipynb contains an adaption of setting up the same thing on Google Colab. You would need to repeat the exercise on that because you would definitely need a powerful GPU for later labs and your Team Projects.