SE7: Design of AI-based & Data-Intensive Systems

About this course

Architect large systems whose core is data and machine learning, balancing throughput, latency, consistency, and the demands of AI workloads.

Course format. Thirteen weeks, four contact hours each: a two-hour lecture (concepts and theory) and a two-hour practice session. The course is project-based; teams carry one running project end to end and present it three times, in weeks 5, 8, and 13.

What you will build

Architected a data-intensive system that ingests events through Apache Kafka, processes them with Spark batch and Flink streaming pipelines into a replicated, partitioned Cassandra store, serves predictions from an MLflow-tracked model, and guards quality with Evidently drift monitoring.

Expected outcomes

Explain the foundations of reliability, scalability, and maintainability
Analyze data models, storage engines, and indexing trade-offs
Design batch and streaming data pipelines
Reason about distributed storage, replication, and partitioning
Apply the theory of consistency, consensus, and the CAP trade-off
Integrate machine learning models into production systems
Evaluate architectural trade-offs for AI-based and data-intensive workloads
Address technical debt and operational concerns in ML systems
Design for fault tolerance and exactly-once processing semantics
Assess data quality, lineage, and governance in pipelines

Key topics

Data pipelines & streaming
System architecture & trade-offs
Distributed storage
ML system integration

Theoretical foundations

The concepts and results this course rests on.

Reliability, scalability, and maintainability as system properties
Storage-engine internals and data-model trade-offs
The MapReduce model of large-scale batch computation
Event-time dataflow, watermarks, and windowing theory
Replication, partitioning, and the CAP and PACELC trade-offs
Distributed consensus and exactly-once processing semantics
Hidden technical debt and operational concerns in ML systems

Prerequisites

This is a Year-3 course. It assumes the mandatory CS core: data structures and algorithms, operating systems, computer networks, databases, software engineering, and the core mathematics (linear algebra, probability and statistics, calculus, discrete mathematics). It additionally requires the specific prior courses listed below.

Course-specific prerequisites:

Databases
Algorithms and data structures
Machine Learning

Weekly schedule 13 weeks · lecture + practice

Foundations

Wk 1

Data-Intensive System Foundations

LectureIntroduce reliability, scalability, and maintainability and the challenges of data at scale.

PracticeSet up the project environment and define the data domain and pipeline goals.

ProjectProject repository and data-system design baseline are established.

WatchMIT 6.824 Lecture 1: Introduction

Wk 2

Data Models and Storage Engines

LectureExamine relational, document, and graph models and the internals of storage engines.

PracticeModel the project data and select appropriate storage for its workload.

ProjectProject data model and storage choice are implemented and justified.

WatchMIT 6.824 Lecture 3: GFS

Wk 3

Encoding and Data Movement

LectureDiscuss serialization formats, schema evolution, and the dataflow between systems.

PracticeDefine schemas and ingest raw data into the project storage layer.

ProjectProject ingests and stores data with a versioned schema.

WatchMaster Data Serialization: Building Data Models with Avro

Pipelines

Wk 4

Batch Processing

LectureCover the batch processing model, MapReduce, and the theory of large-scale computation.

PracticeBuild a batch pipeline that transforms project data into derived datasets.

ProjectProject produces derived datasets through a batch pipeline.

WatchMIT 6.824 Lecture 15: Big Data, Spark · MIT 6.824 Lecture 2: RPC and Threads

Wk 5

Specification MilestonePresentation

LectureReview streaming concepts and how architecture choices shape data systems.

PracticeStudent teams present their project specification: data sources, pipeline architecture, and scaling goals.

ProjectApproved specification with pipeline architecture is delivered.

WatchApache Kafka: What It Is and Where It Is Going (Tim Berglund)

Wk 6

Stream Processing

LectureExamine streaming systems, event time versus processing time, and windowing theory.

PracticeAdd a streaming pipeline that processes project events in near real time.

ProjectProject processes events through a streaming pipeline.

WatchIntro to Stream Processing with Apache Flink, Flink 101 · What is Apache Kafka? Confluent Lightboard (Tim Berglund)

Distributed Storage

Wk 7

Replication and Partitioning

LectureAnalyze replication strategies, partitioning, and rebalancing in distributed stores.

PracticeConfigure replication and partitioning for the project data store.

ProjectProject data store is replicated and partitioned for scale.

WatchMIT 6.824 Lecture 4: Primary-Backup Replication · MIT 6.824 Lecture 8: Zookeeper

Wk 8

Interim Demo MilestonePresentation

LectureCover consistency models, consensus, and the CAP and PACELC trade-offs.

PracticeStudent teams present an interim demo of the data pipeline and storage layer.

ProjectWorking pipeline and distributed storage are demonstrated.

WatchMIT 6.824 Lecture 6: Fault Tolerance, Raft Part 1

Wk 9

Consistency and Fault Tolerance

LectureDiscuss transactions across nodes, exactly-once semantics, and fault-tolerance design.

PracticeAdd fault tolerance and delivery guarantees to the project pipeline.

ProjectProject pipeline tolerates faults with defined delivery guarantees.

WatchMIT 6.824 Lecture 7: Fault Tolerance, Raft Part 2 · MIT 6.824 Lecture 12: Distributed Transactions · MIT 6.824 Lecture 11: Cache Consistency, Frangipani

ML Systems

Wk 10

ML System Integration

LectureExamine the anatomy of ML systems, feature pipelines, and serving architectures.

PracticeIntegrate a trained model into the project as a serving component.

ProjectProject serves predictions from an integrated ML model.

WatchStanford MLSys Seminar Episode 5: Chip Huyen

Wk 11

Feature Stores and Data Quality

LectureDiscuss feature engineering pipelines, data lineage, and the cost of poor data quality.

PracticeAdd feature processing and data-quality checks to the project pipeline.

ProjectProject enforces data-quality checks feeding the model.

WatchThe Future of Feature Stores and Platforms (MLOps Community Podcast)

Wk 12

Technical Debt and Operations

LectureAnalyze hidden technical debt in ML systems, monitoring, and drift over time.

PracticeAdd monitoring and drift detection to the project ML pipeline.

ProjectProject monitors model and pipeline health in operation.

WatchMachine Learning, Technical Debt, and You, D. Sculley (Google) · ML Monitoring, Stanford CS329S with Alessya Visnjic (WhyLabs)

Capstone

Wk 13

Final Demo and DefensePresentation

LectureSynthesize data-intensive and AI system architecture and review the trade-offs made.

PracticeStudent teams present the final demo with an oral defense of architecture, scaling, and ML integration decisions.

ProjectFinal data-intensive AI system is delivered with documentation and defense.

AI tools in this course.

Students use AI assistants to generate Spark and Flink transformations, refactor pipeline code, and draft Kafka producers, consumers, and schema definitions. They prompt tools to build data-quality checks, synthesize event streams, and write tests for exactly-once and fault-tolerance paths, while connecting agents to database, MLflow, and pipeline MCP servers to inspect state and metrics. AI helps wire model serving, feature processing, and Evidently drift monitoring, and to analyze lineage and operational telemetry into remediation steps. Because generated distributed code can hide consistency or replay bugs, students validate every pipeline change against partitioning, replication, and delivery-guarantee behavior.

Student project

Teams build one data-intensive system that ingests data, processes it through batch and streaming pipelines, and stores it in a replicated, partitioned distributed store. They integrate a machine learning model for serving, add data-quality and drift monitoring, and reason explicitly about consistency, fault tolerance, and architectural trade-offs throughout.

Requirements

Build a working system, not a set of disconnected exercises.
Be original: a new system that solves a real problem, not a re-implementation of a tutorial or course demo.
Show real depth: real data, real users or realistic load, and engineering trade-offs that are measured rather than assumed.
Carry one running project from specification to a deployed, defensible result across the whole term.
Work in a team of three or four and defend the design at each of the three presentations (weeks 5, 8, and 13).

Example projects

Real-time analytics platformRecommendation system pipelineFraud detection streamLog analytics and searchClickstream insights enginePredictive maintenance systemSocial media trend trackerSensor data lakehouse

Assessment & grading

Grading is project-based, with no written exam. Teams of three or four present one running project three times.

Component	What it covers	Weight
Project · Specification	Presentation 1 (week 5): problem, objectives, and architecture	20%
Project · Interim	Presentation 2 (week 8): the working system demonstrated live	30%
Project · Final	Presentation 3 (week 13): end-to-end demo with oral defense	50%

Tools & platforms

Apache Kafka: stream and buffer event data
Apache Spark: run batch and distributed processing
Apache Flink: process streams with event-time semantics
PostgreSQL: store relational and derived data
Apache Cassandra: provide distributed partitioned storage
Apache Airflow: orchestrate data pipelines
dbt: transform and model warehouse data
MLflow: track and serve machine learning models
Feast: manage a feature store
Docker: package pipeline and serving components
Evidently: monitor data and model drift

Free online courses

Existing free, video-based courses this course can build on, for self-study or as a teaching basis.

YouTubeMIT 6.824 Distributed Systems (Spring 2020)
MIT lectures on replication, consistency, data-intensive systems
MIT OCWDistributed Computer Systems Engineering (6.824)
MIT OCW storage, fault tolerance, distributed data systems

In Hebrew · בעברית

Campus ILמבוא למדעי הנתונים: כלים ושיטות
קורס וידאו חינם בעברית בהנדסת נתונים, מכון טכנולוגי חולון

Primary literature

Seminal works to read for graduate-level depth.

PaperMapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat, 2004
PaperDynamo: Amazon's Highly Available Key-value Store
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels, 2007
PaperSpanner: Google's Globally-Distributed Database
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, and others, 2012
PaperIn Search of an Understandable Consensus Algorithm (Raft)
Diego Ongaro, John Ousterhout, 2014
PaperThe Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, Sam Whittle, 2015
PaperHidden Technical Debt in Machine Learning Systems
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison, 2015

References

Books and resources link to an online or publisher page.

TextbookDesigning Data-Intensive Applications
Martin Kleppmann, 2017, Core textbook
TextbookStreaming Systems
Tyler Akidau, Slava Chernyak, Reuven Lax, 2018
PaperMapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat, 2004, OSDI 2004
PaperDynamo: Amazon's Highly Available Key-value Store
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al., 2007, SOSP 2007
PaperHidden Technical Debt in Machine Learning Systems
D. Sculley, Gary Holt, Daniel Golovin, et al., 2015, NeurIPS 2015
TextbookDesigning Machine Learning Systems
Chip Huyen, 2022
DocumentationApache Kafka Documentation
Apache Software Foundation, 2026, Official, continuously updated

Role in each concentration

Concentration	Role
Intelligent Software Systems	Core · Semester 2
Networking & Cyber Security	Elective
AI & Robotics	Core · Semester 2
AI and Quantum Computing for Finance	Core · Semester 2
Immersive Systems & Game Development	Elective
Defense Technologies & Autonomous Systems	Elective

← SE6 · Mobile, IoT & Edge Software Development CY1 · Applied Cryptography →