Keeping Agentic Applications Safe: Insights From O’Reilly & Dataiku

Scaling AI Kurt Muehmel

As GenAI makes its way deeper into enterprise systems, the bar for safety is rising fast. Today’s agentic applications — powered by large language models (LLMs) — aren’t just generating responses; they’re making decisions, interacting with tools, and operating in complex, high-stakes environments.

In the just-released sneak peek of Chapter 5 of our upcoming technical guide in partnership with O'Reilly Media, "The LLM Mesh: An Architecture for Building Agentic Applications in the Enterprise,” we focus on safety: what it means, why it matters, and how IT leaders can design for it using an LLM Mesh architecture. With just two chapters remaining, now is the time to ensure your AI systems are not only smart and scalable, but safe by design.

→ Read Now: The First 5 Early Release Chapters

Let's Back Up: What Does "Safe" Really Mean?

In enterprise AI, safety is about more than just accuracy. A truly safe agentic application is:

  • Reliable: It behaves predictably and avoids hallucinations.
  • Harmless: It avoids generating toxic, biased, or offensive content.
  • Compliant: It respects laws, regulations, and company policies.

While safety is technically a subset of quality (discussed in Chapter 4), it must be treated as a top-level concern. An answer that is factually correct but violates your company’s ethical guidelines — or worse, regulatory obligations — is a critical failure. In the six essential pillars for AI safety, an LLM Mesh is essential for efficiently ensuring three of them, and supports the remaining three.

An LLM Mesh’s contribution to providing six pillars of AI safety.

An LLM Mesh’s contribution to providing six pillars of AI safety.

Designing for Safety in an LLM Mesh

Safety isn’t a single checkpoint; it’s a layered system. In an LLM Mesh, safety filters are embedded throughout the architecture:

  • At the input layer, to catch harmful or malicious user prompts before they reach the LLM.
  • At the model and tool invocation level, to prevent misuse of external systems.
  • At the output layer, to filter or flag unsafe generated content before it reaches users.

These filters are modular and reusable, making it easy for development teams to enforce safety policies consistently across applications. Some filters operate in a “self-correcting” mode (e.g., redacting PII automatically), while others work in a “fail-safe” mode — halting execution or escalating for human review when high-risk content is detected.

The key is defense in depth: using multiple safety layers to create overlapping protections. If one filter misses something, another can catch it.

Moderation and Hallucination Detection 

The LLM Mesh supports multiple moderation strategies, which can be mixed and matched depending on the application’s risk profile.

Content moderation can include:

  • Rules-based techniques e.g., keyword blacklists or regex patterns to detect PII
  • ML classifiers: Trained to flag hate speech, harassment, or other policy violations
  • Hosted APIs such as OpenAI’s Moderation API, Azure AI Content Safety, or Google’s Perspective API
  • Self-hosted models including Microsoft Presidio, NVIDIA’s NemoGuard, or Google’s ShieldGemma.

Combining lightweight, self-hosted classifiers for everyday use with more powerful hosted services for complex edge cases can offer both cost efficiency and strong coverage.

Hallucinations — outputs that sound right but are factually wrong — are another major risk. An LLM Mesh can integrate established and emerging techniques such as:

  • Semantic Entropy Probes (SEPs), which monitor a model’s internal uncertainty.
  • Claim extraction and verification pipelines, which cross-check LLM outputs against trusted knowledge bases.
  • Retrieval-augmented generation (RAG), which grounds responses in factual documents and reduces the chance of made-up content.

When needed, an LLM Mesh can flag suspect content, warn the user, or automatically route the prompt to a more grounded answer.

Operational Transparency in an Opaque System

LLMs are black boxes. You can’t fully explain why a model chose a particular word or response. But the LLM Mesh offers transparency, even when full explainability is out of reach. It does this by logging:

  • The origin and structure of prompts.
  • Which tools, agents, and retrieval systems were involved.
  • Which LLM model was used and why.
  • What filters were triggered at each stage.

This creates an auditable trail, allowing compliance, legal, or operations teams to understand exactly what happened — even if the model’s internal logic remains opaque. Without an LLM Mesh, these events are fragmented across teams and tools. With one, safety and traceability are built in.

Adversarial and Robustness Testing

No safety system is complete without testing. Agentic applications must be validated under pressure, just like software is stress-tested before production. Adversarial testing includes:

  • Simulating jailbreak attempts using tricky or role-play prompts.
  • Embedding malicious commands inside legitimate-looking inputs.
  • Using tools like HarmBench to test against known harmful behavior scenarios.

Robustness testing pushes the system with unexpected inputs like extremely long conversations, code snippets or malformed characters, and context switches or contradictory instructions.

In a well-architected LLM Mesh, adversarial testing is automated and continuous. Red-team prompts can be run through agentic evaluators to spot vulnerabilities before bad actors find them.

Human Feedback as a Final Safeguard

Even the most sophisticated safety systems can’t catch every nuance. That’s why human-in-the-loop workflows are essential. Depending on the context, an LLM Mesh can:

  • Automatically escalate uncertain or high-risk responses to human reviewers.
  • Route certain topics — like medical or legal advice — to moderators or domain experts.
  • Learn from human corrections to improve filters, prompts, or fine-tuning data.

For example, in a customer support setting, an agentic system might draft a response, but flag it for review before sending it to a user. The human agent can then approve, edit, or replace the draft — and the system learns from that feedback. In less critical contexts, review might happen asynchronously or only when specific signals are triggered (e.g., a tone mismatch or edge-case classification).

Safety Is a Strategic Requirement

Safety isn’t just a technical detail — it’s a business requirement. It protects your users, your brand, and your legal standing. It should be treated on par with security, reliability, and performance.

With just two chapters left in this guide, we’re entering the final stretch. In Chapter 6, we’ll explore how to lock down your LLM Mesh with enterprise-grade security and access controls. And in our final chapter, we’ll walk through a full, concrete example of an agentic application, bringing together everything from cost and performance to safety and security.

You May Also Like

Why 46% of AI Models Fail — and How to Fix It

Read More

Create and Control AI Agents at Scale With Dataiku

Read More

How CoEs Can Use the 6S Framework to Scale Self-Service and AI Agents

Read More

5 AI Agent Use Cases to Kickstart Your Team's Transformation

Read More