AI4: Scalable AI: Big-Data Algorithms

About this course

Learn algorithms and systems for machine learning at scale, where datasets and models exceed a single machine.

Course format. Thirteen weeks, four contact hours each: a two-hour lecture (concepts and theory) and a two-hour practice session. The course is project-based; teams carry one running project end to end and present it three times, in weeks 5, 8, and 13.

What you will build

Built a scalable training and serving pipeline in Python with PyTorch Distributed, DeepSpeed, Apache Spark, and Kafka, layering data and pipeline parallelism, ZeRO sharding, sketch-based approximate aggregation, and streaming ingestion measured for throughput and cost.

Expected outcomes

Analyze the complexity and communication costs of distributed algorithms
Derive data, model, and pipeline parallelism for large-scale training
Explain synchronous and asynchronous stochastic gradient descent and its convergence
Build big-data pipelines with MapReduce and dataflow frameworks
Implement approximate algorithms including sketches, hashing, and sampling
Design streaming algorithms with bounded memory over unbounded data
Quantify scaling laws relating model loss to data, parameters, and compute
Apply mixed precision, sharding, and gradient accumulation for memory efficiency
Evaluate throughput, latency, and cost trade-offs in distributed systems
Deploy a scalable training and serving pipeline on a cluster

Key topics

Distributed training
Big-data frameworks
Approximate & streaming algorithms
Scaling laws

Theoretical foundations

The concepts and results this course rests on.

Amdahl's law and the memory, compute, and communication walls
the MapReduce model, dataflow graphs, and fault tolerance
synchronous stochastic gradient descent and gradient all-reduce
model, pipeline, and tensor parallelism with ZeRO sharding
count-min sketch, HyperLogLog, and locality-sensitive hashing
the streaming model and the Johnson-Lindenstrauss lemma
neural scaling laws and compute-optimal resource allocation

Prerequisites

This is a Year-3 course. It assumes the mandatory CS core: data structures and algorithms, operating systems, computer networks, databases, software engineering, and the core mathematics (linear algebra, probability and statistics, calculus, discrete mathematics). It additionally requires the specific prior courses listed below.

Course-specific prerequisites:

Machine Learning
Algorithms and data structures
Distributed systems or databases basics

Weekly schedule 13 weeks · lecture + practice

Scaling foundations

Wk 1

Why scale and what breaks

LectureWe analyze the limits of single-machine computation, Amdahl's law, and the memory, compute, and communication walls.

PracticeProfile a single-node training job and identify its bottlenecks.

ProjectChoose the scalable pipeline target and establish a single-node baseline.

WatchStanford CS336 Lecture 6: Kernels, Triton

Wk 2

Big-data frameworks

LectureWe cover the MapReduce model, dataflow graphs, partitioning, and fault tolerance.

PracticeImplement a MapReduce-style aggregation and run it on a Spark cluster.

ProjectBuild a distributed data-ingestion stage for the project.

Wk 3

Distributed storage and data formats

LectureWe cover columnar formats, sharding, partitioning, and the CAP trade-offs of data-intensive systems.

PracticeConvert the dataset to a columnar format and benchmark scan throughput.

ProjectOptimize the project data layout for parallel access.

Distributed training

Wk 4

Data-parallel training

LectureWe derive synchronous SGD, gradient all-reduce, and the convergence effect of large batch sizes.

PracticeRun data-parallel training across multiple workers and measure scaling efficiency.

ProjectMake the training stage data-parallel.

WatchMIT 6.5940 Lecture 17: Distributed Training (Part I) · Stanford CS336: Parallelism 1

Wk 5

Model and pipeline parallelismPresentation

LectureWe cover model sharding, pipeline parallelism, tensor parallelism, and the bubble overhead they incur.

PracticeTeam presentation: each team defends its scaling specification and target metrics.

ProjectLock the specification and prototype a sharded model stage.

WatchMIT 6.5940 Lecture 18: Distributed Training (Part II) · Stanford CS336: Parallelism 2

Wk 6

Memory and precision efficiency

LectureWe cover mixed precision, activation checkpointing, ZeRO sharding, and gradient accumulation.

PracticeApply mixed precision and sharding to fit a larger model in memory.

ProjectScale the model size with memory-efficient training.

WatchMIT 6.5940 Lecture 19: Distributed Training

Approximate algorithms

Wk 7

Sketches and hashing

LectureWe derive count-min sketch, HyperLogLog, and locality-sensitive hashing with their error bounds.

PracticeImplement a count-min sketch and HyperLogLog and validate accuracy versus memory.

ProjectAdd approximate aggregation to the data pipeline.

WatchMMDS Lecture 13: Minhashing · MMDS Lecture 38: Bloom Filters

Wk 8

Sampling and dimensionality reductionPresentation

LectureWe cover reservoir sampling, random projections, and the Johnson-Lindenstrauss lemma.

PracticeTeam presentation: interim demo of the scaled pipeline with throughput numbers.

ProjectAdd sampling and random projection to reduce data volume.

WatchMMDS Lecture 39: Sampling a Stream

Streaming

Wk 9

Streaming algorithms

LectureWe define the streaming model, bounded memory over unbounded input, and windowed computation.

PracticeBuild a streaming aggregation over a simulated event stream.

ProjectAdd a streaming ingestion path to the pipeline.

WatchMMDS Lecture 36: Mining Data Streams · MMDS Lecture 37: Counting 1's

Wk 10

Online and approximate learning

LectureWe cover online gradient descent, regret bounds, and incremental model updates.

PracticeImplement online learning that updates the model as data streams in.

ProjectEnable continuous online updates in the project.

Scaling laws

Wk 11

Scaling laws and compute budgets

LectureWe derive empirical scaling laws relating loss to data, parameters, and compute, and compute-optimal allocation.

PracticeFit a small scaling-law curve from runs at several model sizes.

ProjectUse scaling-law analysis to choose the project compute budget.

WatchStanford CS336: Scaling Laws

Serving

Wk 12

Scalable inference and serving

LectureWe cover batching, caching, autoscaling, and latency-throughput-cost trade-offs at serving time.

PracticeDeploy the model behind a scalable serving layer and load-test it.

ProjectStand up scalable serving for the trained model.

WatchStanford CS336: Inference

Capstone

Wk 13

Final defensePresentation

LectureWe synthesize distributed training, approximate and streaming algorithms, and scaling laws, and survey open problems.

PracticeTeam presentation: final demo with scaling benchmarks and an oral defense of design choices.

ProjectDeliver the complete scalable training-and-serving pipeline with benchmarks.

AI tools in this course.

Students use AI assistants to generate and refactor PyTorch Distributed and DeepSpeed launch scripts, Spark and Kafka pipeline code, and sketch-algorithm implementations, vibe-coding the all-reduce and ZeRO-sharded training stages. They prompt AI to write count-min sketch and HyperLogLog tests, synthesize streaming event data, and draft autoscaling and serving configs. AI also helps interpret throughput profiles, scaling-efficiency curves, and cost logs to locate the bottleneck a job hit.

Student project

Teams build one scalable AI pipeline that ingests, processes, trains on, and serves a large dataset, growing from a single node to a distributed cluster. The project layers in data parallelism, memory-efficient training, approximate and streaming algorithms, and scaling-law analysis, with throughput and cost measured at every stage.

Requirements

Build a working system, not a set of disconnected exercises.
Be original: a new system that solves a real problem, not a re-implementation of a tutorial or course demo.
Show real depth: real data, real users or realistic load, and engineering trade-offs that are measured rather than assumed.
Carry one running project from specification to a deployed, defensible result across the whole term.
Work in a team of three or four and defend the design at each of the three presentations (weeks 5, 8, and 13).

Example projects

Distributed image classifier trainingLarge-scale recommendation systemStreaming clickstream analyticsApproximate near-duplicate detection at web scaleDistributed embedding indexing serviceReal-time fraud-detection pipelinePetabyte log aggregation and alertingCompute-optimal language model pretraining study

Assessment & grading

Grading is project-based, with no written exam. Teams of three or four present one running project three times.

Component	What it covers	Weight
Project · Specification	Presentation 1 (week 5): problem, objectives, and architecture	20%
Project · Interim	Presentation 2 (week 8): the working system demonstrated live	30%
Project · Final	Presentation 3 (week 13): end-to-end demo with oral defense	50%

Tools & platforms

PyTorch: training and distributed primitives
PyTorch Distributed: data and model parallelism
DeepSpeed: ZeRO sharding and large-model training
Apache Spark: distributed data processing
Ray: distributed Python and ML workloads
Dask: parallel dataframes and arrays
Apache Kafka: streaming event ingestion
Apache Flink: stateful stream processing
Apache Parquet: columnar storage format
Hugging Face Accelerate: multi-device training
NVIDIA NCCL: collective communication for GPUs
Weights and Biases: distributed experiment tracking

Free online courses

Existing free, video-based courses this course can build on, for self-study or as a teaching basis.

YouTubeMining Massive Datasets, Stanford CS246 [Full Course]
MapReduce, LSH, PageRank, clustering at scale
MIT OCWMathematics of Big Data and Machine Learning (MIT RES.LL-005)
Big-data ML and D4M, free streamable videos

In Hebrew · בעברית

HIT - Holon Institute of Technology (Campus IL)מבוא למדעי הנתונים: כלים ושיטות
Free Hebrew video course on data acquisition, automated data analysis in Python and introductory machine learning methods.
Prof. Yossi Keshet (YouTube)למידת מכונה (Machine Learning)
Hebrew-spoken machine learning lecture series; methods that scale to large datasets.

Primary literature

Seminal works to read for graduate-level depth.

PaperScaling Laws for Neural Language Models
Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei, 2020
PaperZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Rajbhandari, Rasley, Ruwase, He, 2019
PaperGPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Huang, Cheng, Bapna, Firat, Chen, Chen, Lee, Ngiam, Le, Wu, Chen, 2018
PaperDeep Speech 2: End-to-End Speech Recognition in English and Mandarin
Amodei, Anubhai, Battenberg, Case, Casper, Catanzaro, Chen, Chrzanowski, Coates, Diamos, et al., 2015

References

Books and resources link to an online or publisher page.

TextbookMining of Massive Datasets, 3rd edition
Leskovec, Rajaraman, Ullman, 2020
TextbookDesigning Data-Intensive Applications
Kleppmann, 2017
PaperScaling Laws for Neural Language Models
Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei, 2020
TextbookDeep Learning
Goodfellow, Bengio, Courville, 2016
TextbookDive into Deep Learning
Zhang, Lipton, Li, Smola, 2023
DocumentationRay Documentation
Anyscale, 2026
DocumentationPyTorch Documentation
PyTorch Foundation, 2026

Role in each concentration

Concentration	Role
Intelligent Software Systems	Elective
Networking & Cyber Security	Elective
AI & Robotics	Core · Semester 1
AI and Quantum Computing for Finance	Elective
Immersive Systems & Game Development	Elective
Defense Technologies & Autonomous Systems	Elective

← AI3 · Vision AI: Deep Learning for Computer Vision AI5 · Generative AI: Deep Generative Models →