readings – AI4103: ML Systems

Topic 1: Introduction

Gholami, Amir, et al. “AI and Memory Wall.” IEEE Micro Journal (arxiv link)

This very readable paper presents the interplay between throughput, bandwidth, and end-to-end runtime via a case study on Transformer models.

Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv:2001.08361 (2020) (arxiv link)

An empirical study quantifying the improvement in LLM performance with model size, training dataset size, and amount of compute. Also known as Kaplan’s scaling law.

Jacob, Austin, et al. How to Scale Your Model (Part 0: Intro)

A useful summary of why study of ML Systems is important.

Topic 2: Automatic Differentiation

Kevin Clark, “Computing Neural Network Gradients” (link)

A concise refresher on analytical gradient computation for neural networks in terms of matrices and vectors.

Roger Grosse, “CSC321 Lecture 10: Automatic Differentiation” (Slides)

A presentation of automatic differentiation in the context of the Autograd library. It’s nice to see how Autograd builds computation graph in a different manner than PyTorch or Needle.

Chen, Tianqi, et al. “Training Deep Nets with Sublinear Memory Cost” (arxiv link)

Seminal paper that introduced the idea of gradient checkpointing.

Topic 3: Understanding GPU Bottlenecks for ML

Stephen Jones, “How GPU Computing Works”, GTC 2021 (Video)

Excellent introduction to principles behind GPU design and CUDA architecture. Specially good is the discussion of co-design of GPU hardware and CUDA.

Nvidia Docs, “GPU Performance Background User’s Guide” (Link)

A useful discussion of how different deep learning operations get limited by either compute or memory of GPU.

Horace He, “Making Deep Learning Go Brrrr From First Principles” (Blog post)

Nice diagrams and figures to compactly illustrate the 3 components where our ML program spends time - compute, memory, and overhead.

Topic 4: GPU Programming Model

Mark Harris, “An Even Easier Introduction to CUDA” (Nvidia Technical Blog post)

Introduces GPU execution model by vector addition example. One useful concept used is Unified Memory - a pool of managed memory shared by GPU and CPU, thereby simplifying memory management for the programmer. Also, Mark introduces a new CUDA programming pattern called grid-stride loop.

Jeremy Howard, “Getting Started With CUDA for Python Programmers” (Video)

Jeremy shows how to call a CUDA kernel from PyTorch using its cpp_extension. He introduces kernels to do RGB to grayscale conversion and matrix multiplication in a more Pythonic way.

Sasha Rush, “GPU Puzzles” (Link)

A nice collection of puzzles to implement CUDA kernels in Python. Uses Numba Python JIT Compiler.

Topic 5: Transformer FLOPs Math

Einops tutorial (Link), Einsum tutorial (Link)

Einsum is a notation to concisely express complex tensor operations like multiplication and summation. Einops is a library to reshape and manipulate tensors such as rearranging axes and reducing dimensions. One doesn’t have to remember all the different PyTorch functions for tensor manipulations if using einsum and einops.

Topic 6: Introduction to LLM Inference

Niels Rogge, “How a Transformer works at inference vs training time” (Video)
Jacob, Austin, et al. How to Scale Your Model (Part 4: Transformers, Part 7: Inference)