It was, by all appearances, a standard enterprise AI implementation.
The summaries looked clean.
At the top of the screen was a concise paragraph capturing a customer interaction: what was requested, what was explained, and what follow-up was required. Action items were listed neatly below. It was the kind of output you could screenshot for a slide deck. Efficient. Polished. Convincing.
The premise was simple. If employees spent less time documenting interactions, they could spend more time serving customers. Efficiency would increase. Costs would decrease. The model worked in the demo. It summarized transcripts fluently and quickly. The business case felt straightforward.
It moved forward.
The strain didn’t appear in the demo. It appeared in real use.
Transcripts did not always flow through the system in the way the workflow assumed. Attribution of who said what, acceptable in curated samples, became less reliable in the face of the variability of real conversations. When attribution shifted, the summary shifted with it. For some stakeholders, that was inconvenient. For others, it introduced risk.
Then something more structural surfaced.
The assumption had been that there was a single summary for each interaction. In practice, different stakeholders needed different things from the same conversation. Someone preparing for the next engagement cared about context and commitments. Someone evaluating performance cared about adherence to the process. Leadership cared about patterns across many interactions.
One summary could not satisfy all of those needs equally well.
The original framing of saving time on notes began to feel incomplete. Documentation was only one part of the job that documentation performed. Good records preserve continuity. They prevent repeated effort. They carry context forward to the next conversation, the next decision, the next relationship moment. If a generated summary omitted a critical detail and someone had to go back to the original interaction to find it, the downstream cost could easily outweigh the time saved up front. And unlike writing notes, which happens once, the cost of a missing detail can repeat itself across every subsequent interaction with that customer.
Under light use, the system worked. Under sustained use, the edges became visible.
The model had done what it was designed to do. The surrounding system had not yet fully defined its requirements.
It’s tempting to treat generative AI as an easy button.
Providers will say they do summarization. And they do. Models can summarize text. They can condense transcripts. They can produce coherent output from messy inputs.
But capability in isolation is different from capability under context.
The gap isn’t whether the model works. It’s whether the system around it is ready.
I’ve seen this play out repeatedly. The hard questions aren’t technical. They’re the ones that should have been answered before anyone opened a laptop. What is the actual job this tool is supposed to do? Not the elevator pitch version. The operational one. Is the goal speed? Accuracy? Compliance? Relationship continuity? Performance management? Each of those implies a different design, a different metric, and a different definition of done.
Who owns the output if it’s wrong? What happens when accuracy and speed pull in opposite directions and someone has to choose? What does good actually look like, and how will anyone know when they’ve reached it?
These weren’t philosophical questions. They were the kind of questions that get answered eventually, either intentionally before you build or expensively after you scale.
AI lowers the barrier to building. It does not lower the barrier to clarity.
When the summarization tool moved from demonstration to deployment, it functioned less like a feature and more like a pressure test. Variability in data pipelines surfaced. Differences in stakeholder needs became more pronounced. Cost assumptions changed once usage expanded beyond a controlled subset. Metrics that seemed sufficient in theory proved inadequate in practice.
The pressure did not create the weaknesses. It revealed them.
I’ve watched the same pattern unfold in other contexts.
In one case, a generative model was introduced to help draft customer communications. The demo was compelling. With curated prompts and examples, the system produced usable content. It hinted at real scale and the leadership team liked what they saw.
The stated goal was efficiency. Produce more output in less time.
But efficiency was a proxy for something nobody had fully defined. Was success higher engagement? Improved response rates? Stronger brand consistency? Faster turnaround? The system could generate text, but it couldn’t determine which message was right for which audience segment. It couldn’t encode organizational voice without deliberate structure. It couldn’t tell you whether what it produced was actually better, because nobody had agreed on what better meant.
The complexity didn’t disappear when the tool was adopted. It surfaced.
Measurement frameworks had to be built from scratch. Editorial standards had to be written down for the first time. Experiments had to be designed carefully enough to mean something. The promise of speed ran well ahead of the work required to turn speed into value.
The technology functioned. The surrounding system required definition.
There is a broader pattern here.
AI doesn’t introduce ambiguity into organizations. It finds the ambiguity that was already there and makes it move faster. Unclear ownership becomes a bottleneck overnight. Imprecise metrics become arguments about whether anything worked. Inconsistent data becomes a reliability issue in production. The model doesn’t create these conditions. It removes the slack that had been quietly absorbing them.
I think about stress tests in engineering. They aren’t performed to prove a system works under ideal conditions. They’re performed to understand how it behaves under load, where the weak points are, what fails first, and why.
Generative AI acts as a similar test inside organizations.
The demo proves possibility. Deployment applies pressure.
Under that pressure, organizations discover whether they defined the job clearly enough, whether their measurement systems are disciplined enough, whether their governance structures can absorb additional complexity, and whether they’re willing to slow down long enough to align before they scale.
The promise of AI was not inherently wrong. Many of the projected gains were directionally sound. But the promise assumed a level of structural readiness that most organizations had never examined, because nothing had ever required them to.
That is what it took.
This is not a story about bad technology or careless leadership. It’s a story about what happens when building gets easier before thinking does.
When a working model exists, momentum builds quickly. The demo impresses the room. The business case gets approved. The roadmap shifts. And the slower work, the kind that requires sitting with hard questions before anyone writes a line of code, starts to look like unnecessary delay.
Under acceleration, patience feels irresponsible.
But ambiguity doesn’t disappear under pressure. It compounds.
In both of these initiatives, the most significant challenges were not technical. They were definitional. What exactly were we trying to improve? For whom? How would we know when we got there? What tradeoffs were acceptable once we operated at scale?
Those questions don’t disappear because a model performs well in a demo. They become more urgent.
AI does not eliminate the need for product leadership. It intensifies it.
So what does clarity actually look like before you build?
It starts with the job. Not the efficiency narrative or the cost reduction story that fits neatly into a business case, but the real work the tool is supposed to do and for whom. In the summarization example, that meant asking not just whether time could be saved writing notes, but what those notes were actually for. Who reads them next? What decision do they support? What happens downstream when they’re incomplete? A summary isn’t valuable because it exists. It’s valuable because of what it carries forward.
It extends to the people who will live with the output. Not just the ones in the demo. Different stakeholders interact with the same artifact in fundamentally different ways. Designing for one and discovering the others in production is an expensive way to learn something that a few deliberate conversations could have surfaced earlier.
It forces agreement on what success means before the first model is trained. Not directionally, but specifically. What metric moves? By how much? Over what timeframe? What would failure look like, and how would you know? These conversations are uncomfortable because they expose tradeoffs. But they are far less expensive than months of development followed by a room full of people debating whether anything worked.
And it requires honesty about the foundation. Clean data. Clear ownership. Defined workflows. Realistic cost assumptions at scale. These aren’t bureaucratic hurdles. They are the conditions that determine whether what gets built is worth sustaining.
None of this is slow for its own sake. It’s the work that makes speed durable. Organizations that did it well weren’t cautious. They were precise. They moved quickly once they knew what they were building and why. The ones that skipped it moved fast too, right up until the moment they didn’t.
Clarity before speed isn’t a philosophy. It’s the actual cost of doing this right.
The summaries looked clean.
Under pressure, the gaps appeared.
The model did what it was designed to do.
The question was whether the organization around it was ready to carry the weight.
You were promised everything.
What it took was clarity before speed.
