Systems Thinking for ML Engineers
Most ML education is about models: how to train them, how to evaluate them, how to squeeze out another percentage point of accuracy. What it doesn’t prepare you for is deploying a model into a system that changes in response to the model.
This is the part that bites you.
The classic example: recommendation systems
You train a recommendation model on user behavior. You deploy it. It changes what content users see, which changes their behavior, which changes the training data for the next version of the model. Your training distribution has shifted — not because the world changed, but because you changed it.
This is a feedback loop. The model is not a passive predictor; it’s an actor that reshapes the environment it predicts on.
Systems thinking gives you vocabulary for this. Jay Forrester’s framework has three core concepts:
- Stocks — quantities that accumulate over time (user watch history, model weights, cached features)
- Flows — rates of change (new user interactions, gradient updates, feature refresh cadence)
- Feedback loops — where an output becomes an input (model outputs influence future training data)
Reinforcing vs. balancing loops
Feedback loops come in two flavors:
Reinforcing loops amplify change. In a recommendation system: the model surfaces popular content → popular content gets watched more → it gets even more popular. This is the classic filter bubble, and it emerges naturally from the architecture without anyone designing it.
Balancing loops resist change. Churn prediction is an example: the model flags at-risk users → retention effort is applied → churn decreases → the model sees fewer positive examples and starts underconfident. Left unchecked, this degrades the signal the model depends on.
The dangerous thing is that both kinds of loops are invisible if you’re only looking at model metrics. Your accuracy metrics can look great while the system is drifting badly.
Emergence
Emergence is when system-level behavior isn’t predictable from component-level behavior. It’s not a bug; it’s an inherent property of complex systems.
A concrete ML example: you train separate models for search ranking and ad targeting. Each is well-behaved in isolation. Combined, they create an incentive structure that surfaces emotionally provocative content — not because either model was optimized for that, but because provocation drives both clicks (search signal) and dwell time (ad signal). The emergent behavior is nobody’s fault and nobody’s design.
The implication: you cannot evaluate an ML system by evaluating its components. You have to observe the system running to know what it does.
Practical consequences
A few things I’ve changed how I think about because of this:
Online evaluation over offline evaluation. A/B tests in production reveal things held-out test sets can’t, because the test set doesn’t change in response to your model. The distribution shift that will actually hurt you is the one caused by your own deployment.
Instrument the stocks, not just the flows. Most ML monitoring focuses on prediction accuracy (a flow: how well is the model doing right now). The more load-bearing signals are often the stocks: training data distributions, feature staleness, user behavior drift. These change slowly and then suddenly.
Draw the causal graph before you instrument. What does this model affect? What affects it? Where are the feedback paths? Five minutes at a whiteboard surfaces assumptions that would otherwise stay implicit for months.
Degrade gracefully. Every ML system needs a fallback that doesn’t depend on model outputs — a rule-based default, a popularity ranking, anything that doesn’t inherit the model’s failure modes. When the feedback loop destabilizes, you want a circuit breaker.
What to read
If this framing is new, the most useful introduction is Donella Meadows’ Thinking in Systems — written before ML was a thing, but directly applicable. For the ML-specific angle, the “Rules of Machine Learning” doc from Google’s ML team covers a lot of this in practical terms.
The main thing is to start looking for the loops. They’re in every system. The question is whether you see them before or after they bite you.