This is a guest article from Utkarsh Kanwat. Utkarsh specializes in production AI systems and enterprise deployments. He designs and builds AI agent architectures, automated workflows, and multi-agent systems, and helps teams navigate complex AI implementation challenges.
Before we dive into his article, let's get to know Utkarsh a little bit more:
1. How far do you think we are from seeing agents delivering real and reliable benefits in the enterprise?
We're already seeing real benefits. What's working are agents with human oversight: customer service, code generation, etc. These deliver ROI by solving specific problems with humans making final decisions. The real value will come when we solve reliability across longer workflows. Then we'll see agents autonomously doing work, handling the cognitive load of interconnected business operations. Until we solve error compounding in long running processes, agents remain powerful tools with human oversight rather than autonomous operators. But today's AI systems are building the foundation for everything that comes next.
2. What's your recommended approach for highly regulated organizations building agent foundations?
Treat AI agents like any other critical system: extensive logging, human oversight, clear rollback mechanisms. My insight from regulated environments is that compliance actually helps with reliability. The governance frameworks that satisfy regulators, HITL workflows, extensive testing, etc, are the same ones that make agents work reliably. Start with low-risk use cases: internal document processing, code generation, etc. are areas where mistakes are containable but you can learn operational patterns.
3. What are you reading these days?
Recently, I have been diving into research papers on LLM reasoning and the LangChain State of AI Agents report. I also follow engineering blogs like those from Anthropic and OpenAI's research team.
As an AI engineer, I'd been building our internal AI systems for multiple years: agentic RAG chatbots, automated DevOps agents, the works. During the project’s spend review call, we confidently discussed our Google Cloud Vertex AI and infrastructure costs: a few tens of thousands of dollars for the quarter. Clean, predictable, totally reasonable.
Then, I started digging deeper.
I spent considerable time tracking down everything we weren't measuring. The hundreds of engineering hours on prompt and agentic flow optimization. The infrastructure engineering that scaling demanded. The security team’s weeks-long compliance reviews. The monitoring and observability stack we had to create.
That "reasonable" API bill? Total business cost impact was 5x to 10x higher at least. I wasn't looking at an iceberg, I was only staring at the tip of a cost structure. I've learned something that most engineering teams discover too late: We're optimizing for token costs when the real expense is system complexity.
Here's why your API bill is just 10% to 20% of your real AI costs, and the hidden layers that blindside engineering teams:
The Visible Costs
Every AI budget starts the same way: LLM API fees, cloud services, and initial development team salaries.
These aren't your full AI costs. They're just the only costs we know how to track. The procurement team can negotiate API pricing, the cloud team understands compute costs, and the engineering manager can estimate developer time. So, naturally, these become the numbers we optimize for.
But here's the problem: What you can measure becomes what you manage. These visible cost estimates are just a fraction of what you'll actually spend.
The Production Reality
When we built our internal agentic chat system, we made a mistake: We assumed scaling up a working demo meant handling more requests and adding error handling just like any other software project. We were very wrong because scaling up AI systems means rebuilding everything.
Let’s take an example of what actually happens when your customer support demo meets production. Your demo chatbot says "Your bill is $X" but your production bot thinks: “Let me check payment history... verify customer tier... review previous complaints... validate against the billing system... confirm tone matches support guidelines... provide answer…” same answer, but at least 6x the tokens.
This is what we need to handle in production:
- Context Explosion: Production needs conversation history, user profiles, and system state where demos used clean inputs.
- Error Handling: Graceful demo failures become complex recovery workflows that guide users through multiple fallback paths.
- Validation Overhead: Every response gets fact-checked, tone-verified, and policy validated.
Your token calculation assumes perfect inputs and simple outputs but production assumes everything breaks and requires proof it works. This shows up in months of engineering time building validation frameworks, monitoring systems, and error recovery logic that your demo never needed. Your 200-token demo response becomes a 1,200-token production interaction, and suddenly your cost model is off by 5x.
The Agents Integration
Your business systems weren't built for AI. AI wasn't built for your business systems. Welcome to integration complexity.
When I built our internal knowledge search agent, I thought the hard part was the AI. The "simple" task of connecting to our knowledge sources turned into building APIs that didn't exist, creating data pipelines for multiple different platforms, and implementing permission logic that could understand which agent should access which information.
Here's what I learned: Every agent integration point becomes a custom software development project. Your agents need to connect to:
- Legacy systems: APIs with non-standard schemas that predate REST conventions
- Internal tools: Software that never anticipated AI integration
- Permission systems: Custom authentication and authorization logic
- Tribal knowledge: Workflows that exist nowhere except in senior employees' heads
Each AI agent needs its own permission system, and those permissions have to change based on who's using it. Suddenly you're not just managing agent permissions, you're managing agent-user permission combinations. The security review process becomes its own cost center, every integration point needs separate assessment, every agent needs its own threat model, and every data pathway needs evaluation for potential misuse.
The integration work never ends. Business systems evolve, APIs change, and your agents may break until someone fixes the connections. You're not just building AI agents, you're also building a custom integration layer between AI and your entire software stack.
The Human Oversight
Here is something to think about: You can't evaluate what you don't understand. And the better your agent gets, the harder it becomes to evaluate.
When we launched our internal knowledge search agent, I thought validation would be straightforward, just check if the answers look right. But "looks right" isn't the same as "is right." Our agent would confidently cite internal documents that existed but misrepresent their content in subtle ways. It would combine information from different sources correctly but draw conclusions that didn’t make sense.
We ended up needing domain experts from different teams dedicating significant time to validation and feedback. Not just anyone could evaluate the agent's responses, you needed someone who understood both the specific domain and how AI models fail in that context.
A legal team member could spot when the agent misinterpreted compliance requirements, but they couldn't explain why the AI consistently made that type of error. An AI engineer could identify the pattern, but they couldn't catch the legal nuances. This multiplies across every agent type. A code review agent might approve code that compiles and follows style guidelines but has security vulnerabilities that only experienced developers would catch.
The maintenance burden is relentless because agent behavior drifts over time. Models update, business requirements change, edge cases emerge. An agent that worked perfectly before might suddenly start making subtle errors in how it interprets documents, errors that only someone with both domain expertise and AI experience would recognize as model drift rather than source material issues.
You're not just building AI agents, you're creating ongoing oversight responsibilities that require specialists who understand both your business domain and AI failure modes.
The Evolution
Traditional software scales with usage but AI systems expand with discovery. When software succeeds, you get more users doing the same thing. When AI succeeds, you get users discovering entirely new things it can do. Your document search agent may become a customer support tool. Your code review bot could start helping with deployment decisions. Your internal chatbot may become the company's knowledge management system.
This isn't like scope creep, it's AI revealing problems you didn't know you had. Here's what happens when your AI succeeds beyond its original scope:
Scope Expansion
Your "simple" document search now ingests HR policies, engineering docs, and customer feedback. Each new data source brings its own challenges and complexity.
Context Problems
As your agents handle more complex and varied tasks, conversations grow longer and require more context. What started as simple Q&A becomes multi-turn reasoning that hits token limits and forces expensive context management.
Infrastructure Complexity
The infrastructure supporting your AI grows its own complexity. Logging systems that tracked simple queries now audit complex decisions. Monitoring tools now track agent reasoning patterns across multiple systems.
Every solved problem reveals new problems your AI could theoretically solve. Users don't just adopt your AI, they adapt it to use cases you never imagined. Your budget assumes linear growth, but you may be dealing with exponential scope expansion.
Strategies That Help
After building systems that hit these cost layers, here's what I've learned works:
Start with your hardest production scenario: Teams build a simple demo that works, then try to scale it up. Instead, identify your most complex production requirement and design for that from day one.
Build shared infrastructure: Every AI system needs the same foundational capabilities: validation frameworks, domain expert feedback loops, permission management, monitoring systems, and integrations. Build these as shared services from day one. This foundation makes subsequent agents much easier to deploy.
Make domain expert validation efficient: You can't eliminate the need for human validation, but you can make it systematic. Create structured feedback loops where domain experts can quickly review and provide feedback. Build interfaces that make expert feedback easy to capture and integrate back into the system.
Plan for deliberate scope expansion: Your AI may evolve beyond its original purpose. Build explicit boundaries or processes for evaluating new use cases.
The Real Problem
The deeper issue isn't that AI is expensive, it's that we're using simple mental models to budget and optimize for production AI solutions.
Consumer AI tools like ChatGPT succeed because they've absorbed all the complexity costs (infrastructure, integration, monitoring) into their subscription price. When you pay $20/month for ChatGPT Plus, you're not just paying for tokens, you're paying for all the hidden infrastructure that makes those tokens useful.
Building your own AI forces you to rebuild all that hidden infrastructure yourself, so organizations shouldn’t budget as if they're just buying tokens and cloud services. It's like budgeting to buy a car engine and being surprised when you also need to build the transmission, suspension, and electrical systems.
The organizations that succeed with enterprise AI are the ones that budget for the full engineering problem, not just the visible API costs. They treat AI implementation like building a critical and complex system, because that's what it actually is.
The Bottom Line
Successful AI tools in production could cost up to 5-10 times more than their pilot versions. This aligns with IDC research showing 88% of AI pilots fail to reach production CIO, often due to these hidden costs. This isn't a failure of the technology, it's a success. The difference between a demo and a production system is all the boring, invisible work that makes it reliable, secure, scalable, and actually useful for real business workflows.
I believe the companies that thrive with AI aren't the ones that build the coolest demos. They're the ones that budget for the full iceberg from day one, design systems that can handle coordination complexity, and build teams that can maintain AI systems.
Your OpenAI bill isn't lying to you, it's just not telling you the whole story. The question isn't whether you can afford to pay attention to the hidden costs. It's whether you can afford to ignore them.