Lectures on ML Systems
This page serves as a reference for understanding Machine Learning Systems (ML Systems). The material consists of video lectures, slides, and additional reading material.
Description
Machine learning models are pervasively used to solve problems in varied fields such as vision, robotics, NLP, and scientific discovery. The increased capabilities of these models has corresponded with increase in their size and compute requirements. Besides, use of these models in real-world applications demands strict requirements on performance parameters, such as latency, throughput, and hardware usage efficiency.
The focus of this course is on exploring these systems-related challenges during training and serving of large language models (LLMs) with special emphasis on Transformer architecture. Topics include GPU architecture and hardware-aware algorithms, ML frameworks and compilers, techniques to parallelize LLMs over multiple GPUs, and reduction of computational complexity and memory footprint.
Textbook: The following book can be helpful for parts of the course:
How to Scale Your Model by Austin, J., Douglas, S., Frostig, R., Levskaya, A., Chen, C., Vikram, S., Lebron, F., Choy, P., Ramasesh, V., Webson, A., & Pope, R. (2025).
Credits: Part of the content in the lectures is based on material from csci 1390 at Brown, created by Deepti Raghavan, and cs 15-442 at CMU, by Tianqi Chen.
Disclaimer: This being the first offering of this class, please anticipate technical difficulties. This is not an official course webpage from Plaksha University; this is maintained personally by instructor.
Feedback: If you have found the material useful, or have suggestions on how it can be improved, I will be happy to hear from you. Please email me at pankaj.pansari@plaksha.edu.in
Lectures
Lecture 1: Introduction
Suggested Reading:
- Gholami, Amir, et al. “AI and Memory Wall.” IEEE Micro Journal (arxiv link)
This very readable paper presents the interplay between throughput, bandwidth, and end-to-end runtime via a case study on Transformer models.
- Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv:2001.08361 (2020) (arxiv link)
An empirical study quantifying the improvement in LLM performance with model size, training dataset size, and amount of compute. Also known as Kaplan’s scaling law.
- Jacob, Austin, et al. How to Scale Your Model (Part 0: Intro)
A useful summary of why study of ML Systems is important.
Lecture 2: Automatic Differentiation
Suggested Reading:
- Kevin Clark, “Computing Neural Network Gradients” (link)
A concise refresher on analytical gradient computation for neural networks in terms of matrices and vectors.
- Roger Grosse, “CSC321 Lecture 10: Automatic Differentiation” (Slides)
A presentation of automatic differentiation in the context of the Autograd library. It’s nice to see how Autograd builds computation graph in a different manner than PyTorch or Needle.
- Chen, Tianqi, et al. “Training Deep Nets with Sublinear Memory Cost” (arxiv link)
Seminal paper that introduced the idea of gradient checkpointing.
Practical 1: Automatic Differentiation Implementation
We do a code walkthrough on how reverse-mode automatic differentiation is implemented in a modern ML framework. We choose Needle, an educational framework with a similar interface as PyTorch, developed at CMU.
Lecture 3: Understanding GPU Bottlenecks for ML
Suggested Reading:
- Stephen Jones, “How GPU Computing Works”, GTC 2021 (Video)
Excellent introduction to principles behind GPU design and CUDA architecture. Specially good is the discussion of co-design of GPU hardware and CUDA.
- Nvidia Docs, “GPU Performance Background User’s Guide” (Link)
A useful discussion of how different deep learning operations get limited by either compute or memory of GPU.
- Horace He, “Making Deep Learning Go Brrrr From First Principles” (Blog post)
Nice diagrams and figures to compactly illustrate the 3 components where our ML program spends time - compute, memory, and overhead.
Lecture 4: GPU Programming Model
Example code: Vector addition, Simple matrix multiplication, Tiled matrix multiplication
Suggested Reading:
- Mark Harris, “An Even Easier Introduction to CUDA” (Nvidia Technical Blog post)
Introduces GPU execution model by vector addition example. One useful concept used is Unified Memory - a pool of managed memory shared by GPU and CPU, thereby simplifying memory management for the programmer. Also, Mark introduces a new CUDA programming pattern called grid-stride loop.
- Jeremy Howard, “Getting Started With CUDA for Python Programmers” (Video)
Jeremy shows how to call a CUDA kernel from PyTorch using its
cpp_extension. He introduces kernels to do RGB to grayscale conversion and matrix multiplication in a more Pythonic way.
- Sasha Rush, “GPU Puzzles” (Link)
A nice collection of puzzles to implement CUDA kernels in Python. Uses Numba Python JIT Compiler.
Practical 2: GPU Profiling
We run Nvidia Nsight Systems and Nsight Compute profilers on our naive and tiled matrix multiplication examples. We identify the key information to look for when using these tools and see how they’re valuable in identifying bottlenecks.
Lecture 5: Transformer FLOPs Math and Introduction to LLM Inference
Slides, Video lectures: Transformer Math, LLM Inference
Suggested Reading:
Einsum is a notation to concisely express complex tensor operations like multiplication and summation. Einops is a library to reshape and manipulate tensors such as rearranging axes and reducing dimensions. One doesn’t have to remember all the different PyTorch functions for tensor manipulations if using einsum and einops.
Niels Rogge, “How a Transformer works at inference vs training time” (Video)
Jacob, Austin, et al. How to Scale Your Model (Part 4: Transformers, Part 7: Inference)