Interpretable and Explainable AI

Course for Graduate students, Spring 2026

Machine learning models increasingly shape decisions that affect real people: medical diagnoses, hiring, legal and lending decisions, scientific discovery to name a few. At the same time, the models we rely on due to their black-box nature are becoming more complex, less transparent, and harder to interpret. This tension raises a fundamental question: how can we understand, trust, and responsibly deploy models whose internal logic is opaque, easy to trick, and might be right for the wrong reasons?

Three distinct paradigms have emerged in response. Interpretable AI focuses on designing models that are inherently understandable; Explainable AI (XAI) seeks to provide post-hoc summaries of complex “black-box” models; and Mechanistic Interpretability attempts to reverse-engineer the internal circuits of neural networks. These approaches offer different guarantees and serve different goals.

This course is motivated by three observations: (i) accuracy alone is no longer sufficient; (ii) not all explanations are created equal; and (iii) many distinct models can fit the same data equally well yet behave very differently. Understanding these effects is essential for building trustworthy machine learning systems.

The goal of this course is therefore not merely to introduce a collection of interpretability and explainability methods, but to develop judgment:

  • When should we prefer inherently interpretable models over post-hoc explanations?
  • What do popular methods like SHAP or LIME actually guarantee, and what do they not?
  • How do explanations interact with robustness, fairness, or privacy?
  • How do we move from treating models as black boxes toward mechanistic interpretability and circuit-level understanding?
  • What does “understanding” mean for large language models and other generative systems?
  • How should understandability be evaluated, and for whom?

By the end of the course, students should be able to critically assess interpretability claims, choose appropriate methods for a given context, and reason clearly about the limits of explanation in modern AI systems.

Time and room. Mondays 12:10–3:10 pm, Busch Campus, SEC-210 (T. Alexander Pond SERC Science & Engineering Resource Center)

Prerequisites. This is a graduate-level course. Students should be comfortable with basic linear algebra, probability, and core machine learning concepts. You should also be able to implement and debug ML experiments in Python (e.g., NumPy, scikit-learn, and PyTorch), including working with real datasets and writing reasonably organized, reproducible code.

Workload. The class workload will consist of two main components: (1) in-class paper discussions with student-led presentations, and (2) a semester-long course project completed individually or in a small team. The rubric is: paper presentation 30%, engagement during the class 20%, project 50%. Paper presentations by students will start from week 4. Syllabus: CS 671 Interpretable and Explainable AI – Spring 2026 (PDF) .

Schedule. The schedule below is tentative. Topics, readings, and ordering may be adjusted during the semester based on class interests, guest lectures, and pacing.

Week Date Topic Details
1 January 26 Introduction. Remote: Zoom Meeting URL, Meeting ID: 977 4177 3620, Passcode: 891706 Syllabus overview, introduction to interpretability and explainability. Reading: Doshi-Velez and Kim, 2017, Towards a Rigorous Science of Interpretable Machine Learning Guest lecture: Alina Jade Barnett from University of Rhode Island with talk on Inherently Interpretable Neural Networks for Scientific Discovery and High-Stakes Decision Support
2 February 2 Global understanding and the human in the loop Data visualization, feature importance, exploratory analysis; stakeholders, use cases, and failure modes. Reading: Hong et. al., 2020, Human Factors in Model Interpretability: Industry Practices, Challenges, Kaur et. al., 2020, Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning
3 February 9 Interpretability and explainability landscape & project directions Overview of project spaces: interpretable models, post-hoc explanations, counterfactuals and recourse, mechanistic interpretability, large language models, and model multiplicity (Rashomon effect)
4 February 16 Inherently interpretable models Linear models, sparsity, generalized additive models, scoring systems, decision trees, and rule-based models. Seminar papers: Caruana et. al., 2015, Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission, Liu et. al., 2022, FasterRisk: Fast and Accurate Interpretable Risk Scores. Guest lecture: Hayden McTavish and Varun Babbar from Duke University with talk on modern algorithms of decision trees
5 February 23 Inherently interpretable models & project proposals Prototype-based and concept-based approaches; project proposal presentations. Seminar papers: Chen at. al., 2019 This Looks Like That: Deep Learning for Interpretable Image Recognition, Koh et. al., 2020, Concept Bottleneck Models.
6 March 2 Post-hoc explanations I: feature and data attributions What feature attributions approximate; local vs. global explanations; LIME, SHAP, gradient-based methods.
7 March 9 Post-hoc explanations II: evaluation and pitfalls Explanation disagreement, sensitivity to baselines and perturbations, sanity checks, manipulation and robustness. Guest lecture: Suraj Srinivas from Bosch AI on feature attribution failures.
March 16 Spring break No class
8 March 23 Counterfactual explanations and algorithmic recourse Counterfactual definitions; actionability and feasibility; robustness and human constraints
9 March 30 Mechanistic interpretability & mid-project updates Representations, features, and circuits; how mechanistic interpretability differs from post-hoc explanations; mid-project updates
10 April 6 Mechanistic interpretability and large language models LLM internals; probing vs. circuits; steering vs. understanding; limits of current approaches
11 April 13 Understanding and reasoning in large language models Chain-of-thought and explanation-based prompting; faithfulness vs. usefulness of explanations; reasoning, abstraction, and failure modes
12 April 20 Multiplicity, underspecification, and the Rashomon effect Model sets rather than single models; why explanations do not resolve ambiguity; implications for fairness and trust
13 April 27 Interpretability, trustworthiness, and deployment Fairness, robustness, privacy, unlearning; regulation, accountability, and open problems
14 May 4 Final project presentations In-class final presentations