About this course
Architect large systems whose core is data and machine learning, balancing throughput, latency, consistency, and the demands of AI workloads.
Architected a data-intensive system that ingests events through Apache Kafka, processes them with Spark batch and Flink streaming pipelines into a replicated, partitioned Cassandra store, serves predictions from an MLflow-tracked model, and guards quality with Evidently drift monitoring.
Expected outcomes
- Explain the foundations of reliability, scalability, and maintainability
- Analyze data models, storage engines, and indexing trade-offs
- Design batch and streaming data pipelines
- Reason about distributed storage, replication, and partitioning
- Apply the theory of consistency, consensus, and the CAP trade-off
- Integrate machine learning models into production systems
- Evaluate architectural trade-offs for AI-based and data-intensive workloads
- Address technical debt and operational concerns in ML systems
- Design for fault tolerance and exactly-once processing semantics
- Assess data quality, lineage, and governance in pipelines
Key topics
- Data pipelines & streaming
- System architecture & trade-offs
- Distributed storage
- ML system integration
Theoretical foundations
The concepts and results this course rests on.
- Reliability, scalability, and maintainability as system properties
- Storage-engine internals and data-model trade-offs
- The MapReduce model of large-scale batch computation
- Event-time dataflow, watermarks, and windowing theory
- Replication, partitioning, and the CAP and PACELC trade-offs
- Distributed consensus and exactly-once processing semantics
- Hidden technical debt and operational concerns in ML systems
Prerequisites
Course-specific prerequisites:
- Databases
- Algorithms and data structures
- Machine Learning
Weekly schedule 13 weeks · lecture + practice
Students use AI assistants to generate Spark and Flink transformations, refactor pipeline code, and draft Kafka producers, consumers, and schema definitions. They prompt tools to build data-quality checks, synthesize event streams, and write tests for exactly-once and fault-tolerance paths, while connecting agents to database, MLflow, and pipeline MCP servers to inspect state and metrics. AI helps wire model serving, feature processing, and Evidently drift monitoring, and to analyze lineage and operational telemetry into remediation steps. Because generated distributed code can hide consistency or replay bugs, students validate every pipeline change against partitioning, replication, and delivery-guarantee behavior.
Student project
Teams build one data-intensive system that ingests data, processes it through batch and streaming pipelines, and stores it in a replicated, partitioned distributed store. They integrate a machine learning model for serving, add data-quality and drift monitoring, and reason explicitly about consistency, fault tolerance, and architectural trade-offs throughout.
Requirements
- Build a working system, not a set of disconnected exercises.
- Be original: a new system that solves a real problem, not a re-implementation of a tutorial or course demo.
- Show real depth: real data, real users or realistic load, and engineering trade-offs that are measured rather than assumed.
- Carry one running project from specification to a deployed, defensible result across the whole term.
- Work in a team of three or four and defend the design at each of the three presentations (weeks 5, 8, and 13).
Example projects
Assessment & grading
Grading is project-based, with no written exam. Teams of three or four present one running project three times.
| Component | What it covers | Weight |
|---|---|---|
| Project · Specification | Presentation 1 (week 5): problem, objectives, and architecture | 20% |
| Project · Interim | Presentation 2 (week 8): the working system demonstrated live | 30% |
| Project · Final | Presentation 3 (week 13): end-to-end demo with oral defense | 50% |
Tools & platforms
- Apache Kafka: stream and buffer event data
- Apache Spark: run batch and distributed processing
- Apache Flink: process streams with event-time semantics
- PostgreSQL: store relational and derived data
- Apache Cassandra: provide distributed partitioned storage
- Apache Airflow: orchestrate data pipelines
- dbt: transform and model warehouse data
- MLflow: track and serve machine learning models
- Feast: manage a feature store
- Docker: package pipeline and serving components
- Evidently: monitor data and model drift
Free online courses
Existing free, video-based courses this course can build on, for self-study or as a teaching basis.
- YouTubeMIT 6.824 Distributed Systems (Spring 2020)
- MIT OCWDistributed Computer Systems Engineering (6.824)
In Hebrew · בעברית
- Campus ILמבוא למדעי הנתונים: כלים ושיטות
Primary literature
Seminal works to read for graduate-level depth.
- PaperMapReduce: Simplified Data Processing on Large Clusters
- PaperDynamo: Amazon's Highly Available Key-value Store
- PaperSpanner: Google's Globally-Distributed Database
- PaperIn Search of an Understandable Consensus Algorithm (Raft)
- PaperThe Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
- PaperHidden Technical Debt in Machine Learning Systems
References
Books and resources link to an online or publisher page.
- TextbookDesigning Data-Intensive Applications
- TextbookStreaming Systems
- PaperMapReduce: Simplified Data Processing on Large Clusters
- PaperDynamo: Amazon's Highly Available Key-value Store
- PaperHidden Technical Debt in Machine Learning Systems
- TextbookDesigning Machine Learning Systems
- DocumentationApache Kafka Documentation
Role in each concentration
| Concentration | Role |
|---|---|
| Intelligent Software Systems | Core · Semester 2 |
| Networking & Cyber Security | Elective |
| AI & Robotics | Core · Semester 2 |
| AI and Quantum Computing for Finance | Core · Semester 2 |
| Immersive Systems & Game Development | Elective |
| Defense Technologies & Autonomous Systems | Elective |