About this course
Apply DevOps discipline to machine-learning and LLM systems, automating training, deployment, monitoring, and continuous delivery of models in production.
Course format. Thirteen weeks, four contact hours each: a two-hour lecture (concepts and theory) and a two-hour practice session. The course is project-based; teams carry one running project end to end and present it three times, in weeks 5, 8, and 13.
What you will buildBuild, deploy, and operate a production AI service end to end: a containerized, CI/CD-gated REST API on Kubernetes, fed by a medallion data pipeline, with an MLflow model registry, a gateway-fronted RAG feature, live drift monitoring, and a security review against the OWASP LLM Top 10.
Expected outcomes
- Explain why production AI systems fail more often in operations than in modelling, and define SLIs, SLOs, and error budgets.
- Use cloud compute, storage, and networking, and choose a deployment model from IaaS to serverless.
- Design and operate a CI/CD pipeline with tests, containers, infrastructure as code, and a versioned REST API.
- Run services under an orchestrator with health checks, rollout patterns, and observability based on RED dashboards.
- Build trustworthy data pipelines with a medallion lake, quality gates, data contracts, and dataset versioning.
- Operate the model lifecycle: experiment tracking, a model registry, serving, monitoring, and drift detection.
- Build and operate RAG services behind gateways, with evaluation suites, tracing, and guardrails.
- Reason about the LLM token economy and engineer for cost and latency.
- Build, trace, and bound tool-using agents, and apply AI security and governance.
- Carry one running service from specification through to a governed production deployment.
Key topics
- CI/CD pipelines
- Model serving & versioning
- Monitoring & drift detection
- LLMOps & agent operations
Theoretical foundations
The concepts and results this course rests on.
- Service-level objectives, error budgets, and reliability theory
- The medallion data architecture, data contracts, and dataset versioning
- The reproducibility triple: code version, data version, and environment
- Data drift versus concept drift and statistical detection (PSI, KS tests)
- Retrieval-augmented generation and grounded prompting
- The agent loop, function calling, and step-level evaluation
- Software, data, and model supply-chain security and the OWASP LLM Top 10
Prerequisites
This is a Year-3 course. It assumes the mandatory CS core: data structures and algorithms, operating systems, computer networks, databases, software engineering, and the core mathematics (linear algebra, probability and statistics, calculus, discrete mathematics). It additionally requires the specific prior courses listed below.
Course-specific prerequisites:
- Machine Learning
- Software engineering and Python
- Operating systems and networking
Weekly schedule 13 weeks · lecture + practice
Part I: Foundations & the Cloud
Wk 1
Production Engineering & the Ops LandscapeLectureThe prototype-to-production gap and the 90/10 inversion; SLIs, SLOs, SLAs, and error budgets; toil, day-one vs day-two operations, blameless postmortems, and the five operational layers.
PracticeSet up a team repository from a template with branch protection and containerise a hello-service with pinned versions.
ProjectCreate the team repo, containerise the hello-service, and shortlist two use-case domains.
Wk 2
Cloud Computing FundamentalsLectureCloud primitives (compute, storage, networking); IaaS, PaaS, SaaS, and serverless; regions and availability zones; the shared-responsibility model; the cost model and blast radius.
PracticeProvision a cloud footprint on a free tier with a budget alert, deploy the hello-service in two deployment models, and tear it down.
ProjectCreate the team cloud space with a budget alert and provision the project storage bucket accessed from code.
Part II: DevOps
Wk 3
CI/CD, Testing & REST ServicesLectureDORA metrics and the testing pyramid; trunk-based development; infrastructure as code with Terraform (desired state, idempotency); REST design, API versioning, and health and readiness endpoints.
PracticeBuild a CI pipeline that lints, tests, builds, and publishes an artifact, and design a versioned REST API skeleton.
ProjectCommit to the project use case; REST API skeleton with two endpoints, a health check, validation, and tests; CI gating every merge.
Wk 4
Orchestration, Deployment Patterns & ObservabilityLectureKubernetes desired-state and reconciliation; deployment patterns (blue-green, canary) and GitOps; the three pillars of observability, the RED method, and reasoning about tail latency on percentiles.
PracticeRun the service under an orchestrator with probes and scaling, execute a canary rollout and rollback, and build a RED dashboard.
ProjectDeploy with probes and three replicas, demonstrate canary plus rollback, and record a baseline p95 on a RED dashboard.
Part III: DataOps
Wk 5
Data Lakes, Pipelines & VersioningPresentationLectureWarehouse vs lake vs lakehouse; the medallion architecture (bronze, silver, gold); orchestration and idempotency; data versioning and lineage.
PracticeStudent Presentation 1 (Specification): each team presents the problem statement and success metrics (SLOs), the system and data architecture, DevOps status, and a risk and governance register, then submits a written report and a tagged release.
ProjectAn orchestrated pipeline with retries and backfills, and a versioned dataset reproduced from a pinned snapshot.
Wk 6
Data Quality, Contracts, Streaming & Feature StoresLectureValidation as code; data contracts between producer and consumer; streaming with Kafka (topics, partitions, consumer groups, at-least-once delivery, idempotent consumers); feature stores and train/serve skew.
PracticeAdd validation gates from bronze to silver with a quarantine table, enforce a data contract, and land a live stream into the lake.
ProjectValidation gates with quarantine, an enforced data contract, and gold feature tables built on one shared definition.
Part IV: MLOps
Wk 7
Experiment Tracking, Model Registry & ServingLectureThe reproducibility triple (git SHA, data version, environment); experiment tracking; the model registry and model cards; serving patterns (online, batch, streaming) and safe rollout (shadow, canary, A/B).
PracticeInstrument training with tracking, register a model with a model card, and serve it behind REST with a safe rollout.
ProjectTracked training with the reproducibility triple pinned; model v1 registered and served; v2 canaried against v1.
Wk 8
Monitoring, Model Drift & GovernancePresentationLectureData drift vs concept drift and how to detect each (PSI, KS tests, embedding distance); retraining triggers paired with documented actions; audit trails and governance.
PracticeStudent Presentation 2 (Interim): teams demonstrate live a working pipeline, a tracked and versioned model in the registry, the model served with canary and a live RED dashboard, a change landing through CI/CD during the talk, and a monitoring and drift plan; submit a report and a tagged release.
ProjectDrift detectors each paired with a documented action, and retraining triggers defined.
Part V: LLMOps & AgentOps
Wk 9
LLM Foundations: AI APIs, Tokens & the Token EconomyLectureNext-token prediction and tokenization; API anatomy and structured outputs; the token economy and its cost levers (shorter context, prompt caching, batching); managed AI services.
PracticeCall an AI API with structured outputs and schema validation, build a token-economy cost model, and add a retry-with-backoff wrapper.
ProjectWire an LLM feature through the course proxy with structured outputs, a cost-per-request and monthly projection, and a pinned model version.
Wk 10
RAG & Serving LLMs: Vector Databases & GatewaysLectureRetrieval-augmented generation end to end (embeddings, vector databases, chunking, grounded prompts); prompts as versioned code; hosted vs self-hosted serving with vLLM; the gateway pattern.
PracticeBuild a RAG service over a real corpus and route all LLM traffic through a gateway with fallbacks and a budget cap.
ProjectThe project's RAG or extraction pipeline runs behind the gateway with fallback and a budget cap.
Wk 11
LLM Evaluation, Guardrails & ObservabilityLectureEvaluation that means something (faithfulness, answer relevance, retrieval recall); LLM-as-judge biases and calibration; LLM tracing and observability; guardrails and prompt injection.
PracticeBuild a representative eval set and run it as a regression suite, add request tracing, and demonstrate a guardrail against injection.
ProjectAn eval set of at least fifty items wired into CI, live tracing of prompt version, tokens, cost, and latency, and one guardrail with a measured response cache.
Wk 12
Agents & AgentOps: Tools, MCP & Managed AgentsLectureThe agent loop (plan, act, observe) and function calling; the Model Context Protocol and tool ecosystem; AgentOps (tracing, step-level evaluation, mandatory bounds); managed agent services.
PracticeBuild and trace a tool-using agent, bound it with step, cost, and permission limits, and expose a tool as an MCP server.
ProjectOne agentic capability with tracing, bounds, and step caps, and at least one tool exposed via MCP.
Part VI: Security & Governance
Wk 13
Security, Governance & SynthesisPresentationLectureSoftware, data, and model supply-chain security (secrets, dependency tracking, provenance, SBOM); the OWASP Top 10 for LLM applications; governance (privacy, audit trails, model cards, NIST AI RMF); synthesis of the five layers.
PracticeStudent Presentation 3 (Final, with oral defense): teams deliver an end-to-end production demo (data in, decision out, live), observability with an actionable alert and runbook, evaluation, guardrails, and a cost and latency report, and a security and governance review against the OWASP LLM Top 10.
ProjectA governed production deployment with an audit trail, and the repository tagged v1.0.
Student project
Teams of three or four carry one running AI service from specification to a governed production deployment, presenting it three times across the term. Grading weights the parts an AI assistant cannot do for the student: operating a system under load, interpreting telemetry, and defending design decisions. Example domains include IoT telemetry, document question-answering, and document processing.
Requirements
- Build a working system, not a set of disconnected exercises.
- Be original: a new system that solves a real problem, not a re-implementation of a tutorial or course demo.
- Show real depth: real data, real users or realistic load, and engineering trade-offs that are measured rather than assumed.
- Carry one running project from specification to a deployed, defensible result across the whole term.
- Work in a team of three or four and defend the design at each of the three presentations (weeks 5, 8, and 13).
Example projects
Predictive-maintenance IoT monitorSupport-docs Q&A chatbotInvoice or form processorMenu or receipt nutrition estimatorSmart-home energy advisorCode-review assistantResearch-paper summariserChurn-prediction service
Assessment & grading
Grading is project-based, with no written exam. Teams of three or four present one running project three times.
| Component | What it covers | Weight |
|---|
| Project · Specification | Presentation 1 (week 5): problem, objectives, and architecture | 20% |
| Project · Interim | Presentation 2 (week 8): the working system demonstrated live | 30% |
| Project · Final | Presentation 3 (week 13): end-to-end demo with oral defense | 50% |
Free online courses
Existing free, video-based courses this course can build on, for self-study or as a teaching basis.
In Hebrew · בעברית
Primary literature
Seminal works to read for graduate-level depth.
References
Books and resources link to an online or publisher page.
- TextbookDesigning Data-Intensive Applications
Martin Kleppmann, 2017, Foundations of reliable data systems
- TextbookSite Reliability Engineering
Beyer, Jones, Petoff, Murphy (eds.), 2016, Free online; SLOs and error budgets
- TextbookThe Site Reliability Workbook
Beyer, Murphy, Rensin, Kawahara, Thorne (eds.), 2018, Practical companion to the SRE book
- DocumentationAWS Well-Architected Framework
Amazon Web Services, current, Cloud architecture pillars
- PaperHidden Technical Debt in Machine Learning Systems
Sculley et al., 2015, NeurIPS 2015
- DocumentationMLflow Documentation
MLflow project, current, Tracking and model registry
- PaperRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al., 2020, The original RAG paper
- DocumentationOWASP Top 10 for LLM Applications
OWASP, 2025, LLM application security risks
- DocumentationBuilding Effective Agents
Anthropic, 2024, Agent design patterns
Role in each concentration