Lectures on ML Systems

Author
Affiliation

Plaksha University

This page serves as a reference for understanding Machine Learning Systems (ML Systems). The material consists of video lectures, slides, and additional reading material.

Description

Machine learning models are pervasively used to solve problems in varied fields such as vision, robotics, NLP, and scientific discovery. The increased capabilities of these models has corresponded with increase in their size and compute requirements. Besides, use of these models in real-world applications demands strict requirements on performance parameters, such as latency, throughput, and hardware usage efficiency.

The focus of this course is on exploring these systems-related challenges during training and serving of large language models (LLMs) with special emphasis on Transformer architecture. Topics include GPU architecture and hardware-aware algorithms, ML frameworks and compilers, techniques to parallelize LLMs over multiple GPUs, and reduction of computational complexity and memory footprint.

Textbook: The following book can be helpful for parts of the course:

How to Scale Your Model by Austin, J., Douglas, S., Frostig, R., Levskaya, A., Chen, C., Vikram, S., Lebron, F., Choy, P., Ramasesh, V., Webson, A., & Pope, R. (2025).

Credits: Part of the content in the lectures is based on material from csci 1390 at Brown, created by Deepti Raghavan, and cs 15-442 at CMU, by Tianqi Chen.

Disclaimer: This being the first offering of this class, please anticipate technical difficulties. This is not an official course webpage from Plaksha University; this is maintained personally by instructor.

Feedback: If you have found the material useful, or have suggestions on how it can be improved, I will be happy to hear from you. Please email me at pankaj.pansari@plaksha.edu.in


Lectures

Topic 1: Introduction

Slides, Video lecture

Topic 2: Automatic Differentiation

Slides, Video lecture

Practical 1: Automatic Differentiation Implementation

Video

We do a code walkthrough on how reverse-mode automatic differentiation is implemented in a modern ML framework. We choose Needle, an educational framework with a similar interface as PyTorch, developed at CMU.

Topic 3: Understanding GPU Bottlenecks for ML

Slides, Video lecture

Topic 4: GPU Programming Model

Slides, Video lecture

Example code: Vector addition, Simple matrix multiplication, Tiled matrix multiplication

Practical 2: GPU Profiling

Video

We run Nvidia Nsight Systems and Nsight Compute profilers on our naive and tiled matrix multiplication examples. We identify the key information to look for when using these tools and see how they’re valuable in identifying bottlenecks.

Topic 5: Transformer FLOPs Math

Slides, Video lecture

Topic 6: Introduction to LLM Inference

Slides, Video lecture

Note: The remaining lectures will be uploaded shortly.