Lectures on ML Systems

Author
Affiliation

Plaksha University

This page serves as a reference for understanding Machine Learning Systems (ML Systems). The material consists of video lectures, slides, and additional reading material.

Description

Machine learning models are pervasively used to solve problems in varied fields such as vision, robotics, NLP, and scientific discovery. The increased capabilities of these models has corresponded with increase in their size and compute requirements. Besides, use of these models in real-world applications demands strict requirements on performance parameters, such as latency, throughput, and hardware usage efficiency.

The focus of this course is on exploring these systems-related challenges during training and serving of large language models (LLMs) with special emphasis on Transformer architecture. Topics include GPU architecture and hardware-aware algorithms, ML frameworks and compilers, techniques to parallelize LLMs over multiple GPUs, and reduction of computational complexity and memory footprint.

Textbook: The following book can be helpful for parts of the course:

How to Scale Your Model by Austin, J., Douglas, S., Frostig, R., Levskaya, A., Chen, C., Vikram, S., Lebron, F., Choy, P., Ramasesh, V., Webson, A., & Pope, R. (2025).

Credits: Part of the content in the lectures is based on material from csci 1390 at Brown, created by Deepti Raghavan, and cs 15-442 at CMU, by Tianqi Chen.

Disclaimer: This being the first offering of this class, please anticipate technical difficulties. This is not an official course webpage from Plaksha University; this is maintained personally by instructor.

Feedback: If you have found the material useful, or have suggestions on how it can be improved, I will be happy to hear from you. Please email me at pankaj.pansari@plaksha.edu.in


Lectures

Lecture 1: Introduction

Slides, Video lecture

Suggested Reading:

  1. Gholami, Amir, et al. “AI and Memory Wall.” IEEE Micro Journal (arxiv link)

This very readable paper presents the interplay between throughput, bandwidth, and end-to-end runtime via a case study on Transformer models.

  1. Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv:2001.08361 (2020) (arxiv link)

An empirical study quantifying the improvement in LLM performance with model size, training dataset size, and amount of compute. Also known as Kaplan’s scaling law.

  1. Jacob, Austin, et al. How to Scale Your Model (Part 0: Intro)

A useful summary of why study of ML Systems is important.

Lecture 2: Automatic Differentiation

Slides, Video lecture

Suggested Reading:

  1. Kevin Clark, “Computing Neural Network Gradients” (link)

A concise refresher on analytical gradient computation for neural networks in terms of matrices and vectors.

  1. Roger Grosse, “CSC321 Lecture 10: Automatic Differentiation” (Slides)

A presentation of automatic differentiation in the context of the Autograd library. It’s nice to see how Autograd builds computation graph in a different manner than PyTorch or Needle.

  1. Chen, Tianqi, et al. “Training Deep Nets with Sublinear Memory Cost” (arxiv link)

Seminal paper that introduced the idea of gradient checkpointing.

Lecture 3: Understanding GPU Bottlenecks for ML

Slides, Video lecture

Suggested Reading:

  1. Stephen Jones, “How GPU Computing Works”, GTC 2021 (Video)

Excellent introduction to principles behind GPU design and CUDA architecture. Specially good is the discussion of co-design of GPU hardware and CUDA.

  1. Nvidia Docs, “GPU Performance Background User’s Guide” (Link)

A useful discussion of how different deep learning operations get limited by either compute or memory of GPU.

  1. Horace He, “Making Deep Learning Go Brrrr From First Principles” (Blog post)

Nice diagrams and figures to compactly illustrate the 3 components where our ML program spends time - compute, memory, and overhead.