Interpretable and Explainable AI

Course for Graduate students, Spring 2026

Machine learning models increasingly shape decisions that affect real people: medical diagnoses, hiring, legal and lending decisions, scientific discovery to name a few. At the same time, the models we rely on due to their black-box nature are becoming more complex, less transparent, and harder to interpret. This tension raises a fundamental question: how can we understand, trust, and responsibly deploy models whose internal logic is opaque, easy to trick, and might be right for the wrong reasons?

Three distinct paradigms have emerged in response. Interpretable AI focuses on designing models that are inherently understandable; Explainable AI (XAI) seeks to provide post-hoc summaries of complex “black-box” models; and Mechanistic Interpretability attempts to reverse-engineer the internal circuits of neural networks. These approaches offer different guarantees and serve different goals.

This course is motivated by three observations: (i) accuracy alone is no longer sufficient; (ii) not all explanations are created equal; and (iii) many distinct models can fit the same data equally well yet behave very differently. Understanding these effects is essential for building trustworthy machine learning systems.

The goal of this course is therefore not merely to introduce a collection of interpretability and explainability methods, but to develop judgment:

When should we prefer inherently interpretable models over post-hoc explanations?
What do popular methods like SHAP or LIME actually guarantee, and what do they not?
How do explanations interact with robustness, fairness, or privacy?
How do we move from treating models as black boxes toward mechanistic interpretability and circuit-level understanding?
What does “understanding” mean for large language models and other generative systems?
How should understandability be evaluated, and for whom?

By the end of the course, students should be able to critically assess interpretability claims, choose appropriate methods for a given context, and reason clearly about the limits of explanation in modern AI systems.

Time and room. Mondays 12:10–3:10 pm, Busch Campus, SEC-210 (T. Alexander Pond SERC Science & Engineering Resource Center)

Prerequisites. This is a graduate-level course. Students should be comfortable with basic linear algebra, probability, and core machine learning concepts. You should also be able to implement and debug ML experiments in Python (e.g., NumPy, scikit-learn, and PyTorch), including working with real datasets and writing reasonably organized, reproducible code.

Workload. The class workload will consist of two main components: (1) in-class paper discussions with student-led presentations, and (2) a semester-long course project completed individually or in a small team. The rubric is: paper presentation 30%, engagement during the class 20%, project 50%. Paper presentations by students will start from week 4. Syllabus: CS 671 Interpretable and Explainable AI – Spring 2026 (PDF) .

Schedule. The schedule below is tentative. Topics, readings, and ordering may be adjusted during the semester based on class interests, guest lectures, and pacing.

Week	Date	Topic	Details
1	January 26	Introduction. Remote: Zoom Meeting URL, Meeting ID: 977 4177 3620, Passcode: 891706	Syllabus overview, introduction to interpretability and explainability. Reading: Doshi-Velez and Kim, 2017, Towards a Rigorous Science of Interpretable Machine Learning Guest lecture: Alina Jade Barnett from University of Rhode Island with talk on Inherently Interpretable Neural Networks for Scientific Discovery and High-Stakes Decision Support
2	February 2	Global understanding and the human in the loop	Data visualization, feature importance, exploratory analysis; stakeholders, use cases, and failure modes. Reading: Hong et. al., 2020, Human Factors in Model Interpretability: Industry Practices, Challenges, Kaur et. al., 2020, Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning
3	February 9	Interpretability and explainability landscape & project directions	Overview of project spaces: interpretable models, post-hoc explanations, counterfactuals and recourse, mechanistic interpretability, large language models, and model multiplicity (Rashomon effect)
4	February 16	Inherently interpretable models	Linear models, sparsity, generalized additive models, scoring systems, decision trees, and rule-based models. Seminar papers: Caruana et. al., 2015, Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission, Liu et. al., 2022, FasterRisk: Fast and Accurate Interpretable Risk Scores. Guest lecture: Hayden McTavish and Varun Babbar from Duke University with talk on modern algorithms of decision trees
5	February 23	Inherently interpretable models & project proposals	Prototype-based and concept-based approaches; project proposal presentations. Seminar papers: Chen at. al., 2019 This Looks Like That: Deep Learning for Interpretable Image Recognition, Koh et. al., 2020, Concept Bottleneck Models.
6	March 2	Post-hoc explanations I: feature and data attributions	What feature attributions approximate; local vs. global explanations; LIME, SHAP, gradient-based methods. Seminar papers: Ribeiro et. al., 2016, “Why Should I Trust You?” Explaining the Predictions of Any Classifier, Lundberg and Lee, 2017, A Unified Approach to Interpreting Model Predictions Scores, Smilkov et. al., 2017, Smoothgrad: Removing noise by adding noise, Sundararajan et. al., 2017, Axiomatic Attribution for Deep Networks.
7	March 9	Post-hoc explanations II: evaluation and pitfalls	Explanation disagreement, sensitivity to baselines and perturbations, sanity checks, manipulation and robustness. Seminar papers: Slack and Hilgard et. al., 2020, Fooling LIME and SHAP, Dombrowski et. al., 2019, Explanations can be manipulated and geometry is to blame. Guest lecture: Suraj Srinivas from Bosch AI on feature attribution failures.
	March 16	Spring break	No class
8	March 23	Counterfactual explanations and algorithmic recourse	Counterfactual definitions; actionability and feasibility; robustness and human constraints. Seminar papers: Mothilal et. al., 2020, Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations, Karimi et. al., 2020, Algorithmic Recourse: from Counterfactual Explanations to Interventions, Upadhyay et. al., 2020, Explaining Counterfactual Explanations May Not Be the Best Algorithmic Recourse Approachs, Turbal et. al., 2025, ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Sets
9	March 30	Mechanistic interpretability & mid-project updates	Representations, features, and circuits; how mechanistic interpretability differs from post-hoc explanations; mid-project updates. Seminar papers: Olah et. al., 2020, Zoom In: An Introduction to Circuits, Bricken et. al., 2023, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
10	April 6	Mechanistic interpretability and large language models	LLM internals; probing vs. circuits; steering vs. understanding; limits of current approaches. Guest lecture: Shawn Im from University of Wisconsin–Madison "How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability"
11	April 13	Understanding and reasoning in large language models	Chain-of-thought and explanation-based prompting; faithfulness vs. usefulness of explanations; reasoning, abstraction, and failure modes
12	April 20	Multiplicity, underspecification, and the Rashomon effect	Model sets rather than single models; why explanations do not resolve ambiguity; implications for fairness and trust
13	April 27	Interpretability, trustworthiness, and deployment	Fairness, robustness, privacy, unlearning; regulation, accountability, and open problems Guest lecture: Ben Laufer from Cornell University on Ethical and Trustworthy AI
14	May 4	Final project presentations	In-class final presentations