About this course
Learn algorithms and systems for machine learning at scale, where datasets and models exceed a single machine.
Built a scalable training and serving pipeline in Python with PyTorch Distributed, DeepSpeed, Apache Spark, and Kafka, layering data and pipeline parallelism, ZeRO sharding, sketch-based approximate aggregation, and streaming ingestion measured for throughput and cost.
Expected outcomes
- Analyze the complexity and communication costs of distributed algorithms
- Derive data, model, and pipeline parallelism for large-scale training
- Explain synchronous and asynchronous stochastic gradient descent and its convergence
- Build big-data pipelines with MapReduce and dataflow frameworks
- Implement approximate algorithms including sketches, hashing, and sampling
- Design streaming algorithms with bounded memory over unbounded data
- Quantify scaling laws relating model loss to data, parameters, and compute
- Apply mixed precision, sharding, and gradient accumulation for memory efficiency
- Evaluate throughput, latency, and cost trade-offs in distributed systems
- Deploy a scalable training and serving pipeline on a cluster
Key topics
- Distributed training
- Big-data frameworks
- Approximate & streaming algorithms
- Scaling laws
Theoretical foundations
The concepts and results this course rests on.
- Amdahl's law and the memory, compute, and communication walls
- the MapReduce model, dataflow graphs, and fault tolerance
- synchronous stochastic gradient descent and gradient all-reduce
- model, pipeline, and tensor parallelism with ZeRO sharding
- count-min sketch, HyperLogLog, and locality-sensitive hashing
- the streaming model and the Johnson-Lindenstrauss lemma
- neural scaling laws and compute-optimal resource allocation
Prerequisites
Course-specific prerequisites:
- Machine Learning
- Algorithms and data structures
- Distributed systems or databases basics
Weekly schedule 13 weeks · lecture + practice
Students use AI assistants to generate and refactor PyTorch Distributed and DeepSpeed launch scripts, Spark and Kafka pipeline code, and sketch-algorithm implementations, vibe-coding the all-reduce and ZeRO-sharded training stages. They prompt AI to write count-min sketch and HyperLogLog tests, synthesize streaming event data, and draft autoscaling and serving configs. AI also helps interpret throughput profiles, scaling-efficiency curves, and cost logs to locate the bottleneck a job hit.
Student project
Teams build one scalable AI pipeline that ingests, processes, trains on, and serves a large dataset, growing from a single node to a distributed cluster. The project layers in data parallelism, memory-efficient training, approximate and streaming algorithms, and scaling-law analysis, with throughput and cost measured at every stage.
Requirements
- Build a working system, not a set of disconnected exercises.
- Be original: a new system that solves a real problem, not a re-implementation of a tutorial or course demo.
- Show real depth: real data, real users or realistic load, and engineering trade-offs that are measured rather than assumed.
- Carry one running project from specification to a deployed, defensible result across the whole term.
- Work in a team of three or four and defend the design at each of the three presentations (weeks 5, 8, and 13).
Example projects
Assessment & grading
Grading is project-based, with no written exam. Teams of three or four present one running project three times.
| Component | What it covers | Weight |
|---|---|---|
| Project · Specification | Presentation 1 (week 5): problem, objectives, and architecture | 20% |
| Project · Interim | Presentation 2 (week 8): the working system demonstrated live | 30% |
| Project · Final | Presentation 3 (week 13): end-to-end demo with oral defense | 50% |
Tools & platforms
- PyTorch: training and distributed primitives
- PyTorch Distributed: data and model parallelism
- DeepSpeed: ZeRO sharding and large-model training
- Apache Spark: distributed data processing
- Ray: distributed Python and ML workloads
- Dask: parallel dataframes and arrays
- Apache Kafka: streaming event ingestion
- Apache Flink: stateful stream processing
- Apache Parquet: columnar storage format
- Hugging Face Accelerate: multi-device training
- NVIDIA NCCL: collective communication for GPUs
- Weights and Biases: distributed experiment tracking
Free online courses
Existing free, video-based courses this course can build on, for self-study or as a teaching basis.
- YouTubeMining Massive Datasets, Stanford CS246 [Full Course]
- MIT OCWMathematics of Big Data and Machine Learning (MIT RES.LL-005)
In Hebrew · בעברית
- HIT - Holon Institute of Technology (Campus IL)מבוא למדעי הנתונים: כלים ושיטות
- Prof. Yossi Keshet (YouTube)למידת מכונה (Machine Learning)
Primary literature
Seminal works to read for graduate-level depth.
References
Books and resources link to an online or publisher page.
- TextbookMining of Massive Datasets, 3rd edition
- TextbookDesigning Data-Intensive Applications
- PaperScaling Laws for Neural Language Models
- TextbookDeep Learning
- TextbookDive into Deep Learning
- DocumentationRay Documentation
- DocumentationPyTorch Documentation
Role in each concentration
| Concentration | Role |
|---|---|
| Intelligent Software Systems | Elective |
| Networking & Cyber Security | Elective |
| AI & Robotics | Core · Semester 1 |
| AI and Quantum Computing for Finance | Elective |
| Immersive Systems & Game Development | Elective |
| Defense Technologies & Autonomous Systems | Elective |