
Opening the engine roomOpening the engine room
How we built our third generation AI tool balancing compliance, cost, and time to impact
Published Feb 11, 2026

0%
Published Feb 11, 2026
Most organisations don’t just need a great AI chatbot, they need a secure, cost efficient, and evolving platform that fits how they actually work. IM_GPT 3.0 is our approach: an application hosted in our own Azure environment integrating with our existing company user profiles (Entra ID) that grounds answers in our knowledge, supports multiple frontier models, and lets us ship new assistants and tools quickly.
Why build your own when ChatGPT, Claude, and Gemini exist?
Most clients, as well as our own teams, not only ask whether they can use state-of-the-art AI, but even more frequently, how they can do so responsibly, cost-effectively, and within existing workflows. While off-the-shelf tools are useful, full enterprise compliance, as required by us and our clients, demands more:
- Full control over data residency
- Clear and auditable traceability
Furthermore, the classic ‘build or buy’ dilemma increasingly pointed us toward building, which became a cornerstone of our AI tool strategy:
- The ability to tailor the UX precisely to how we work
- Consumption-based pricing, reducing our costs by more than 95% compared to enterprise solutions like ChatGPT and Claude
- Model flexibility, allowing us to switch to whichever model best fits each task, without vendor lock-in – even when we have access to premium models such as $200/month GPT-5 Pro
- A company-owned platform, giving us full control of the data path, seamless integration with identity and content systems, and the ability to scale targeted assistants across the organisation
- Intrinsic value in experiencing the same thought, design, and development process as our clients
The case for building platforms
- Data residency and sovereignty: Regulations and client expectations often require data in the EU and tenant-controlled storage. While major vendors have improved EU data residency options, many organisations still prefer end-to-end control of chats, files, logs, and indexes within their own cloud.
- Economics that reflect usage: Seat licenses are simple but blunt instruments. Usage is spiky. Some colleagues live in our tool; others open it once a week. Paying per token lets you reserve premium models for power users and keep the average cost per user low.
- Integration where people work: Single signon with Entra ID, SharePoint/OneDrive connectors, and logging/monitoring in your own observability stack. This enables governance and faster change-management.
Finding the right balance between advanced architecture and rapid iteration
We moved from a Streamlit prototype to a React frontend and FastAPI backend, establishing clean architecture boundaries that let us iterate quickly without compromising stability. This shift brought a mobile-app-level user experience, with robust state management (provider/consumer patterns, background re-fetching for snappy UIs) and low-latency streaming that makes interaction feel immediate.
Our retrieval augmented (RAG) stack also matured. While early POCs used local vector databases, production required a solution that could support enterprise-scale workloads. We adopted Azure AI Search to enable hybrid/vector retrieval across curated SharePoint spaces, projects, and policy libraries, while surfacing relevant supporting information to improve trust.
Finally, Azure AI Foundry’s agent capabilities now allow assistants to execute structured task flows, call tools, and ground responses in enterprise data, with governance controls for scenarios that may involve sensitive information.
1
Web search compliance is easy to overlook but impossible to ignore
Even if your LLM stack is compliant, the surrounding ecosystem may not be. Consumer web search can introduce privacy and compliance risks. We solved this by ensuring enterprise-controlled grounding and explicit human-in-the-loop steps where needed, preventing unintended exposure of confidential terms or PII.
2
UX quality largely determines broad-based adoption
If the platform doesn’t feel fast, responsive, and dependable, users won’t integrate it into their daily work, regardless of how powerful it is. Treat UX as a first-class engineering concern rather than a late-stage polish task. And make sure to help new users navigate the platform and dedicated training.
3
Your RAG architecture must scale and justify every answer
Enterprise RAG is about being both reliable and credible. The system must scale to a multitude of queries while providing clarity on why results were surfaced, ensuring users can trust outputs and operational teams can govern them.
The ‘Build vs Buy’ and how to decide
1. Choose a platform build if:
- You want strict residency and end-to-end auditability (identity, storage, vector search, prompts, telemetry).
- You expect multiple assistants (HR/IT/Finance support, proposal helpers, project knowledge copilots) and want one consistent UX.
- You can staff the triangle-of-AI-anno-2026: software engineering (async/streaming services), data/ML engineering (pipelines, embeddings, vector search), and product engineering (model chaining/prompt engineering).
2. Choose seat licenses if:
- You’re small or early (e.g., <100–150 users) without an engineering team to maintain a platform. At this size, the total cost of acquisition and ownership will likely be lower.
- Your nearterm need is primarily chat and simple chat and document Q&A.
The engineering patterns that make it work
The engineering patterns that make it work
We’ve built the platform around patterns that keep it responsive under load, adaptable as models and APIs change, and practical for enterprise use. The emphasis is on streaming performance, architectural flexibility, and supporting advanced agentic and retrieval functionality as the platform scales.
Performance and scaling
We use a streaming-first approach end to end, so users get low-latency feedback even when traffic spikes. Model outputs and tool results stream as they’re produced using async queues, and careful parallelisation avoid token and rate limit bottlenecks, which is important when hundreds of users prompt, upload, and search at the same time.
Flexibility of the platform
The platform is built to be as flexible as possible to cater for a constant flow of new capabilities of the AI models we serve. Therefore, the backend is built on FastAPI with clean architecture. Dependencies are inverted across the stack: interfaces and adapters isolate search, storage, and model integrations so providers can be swapped with minimal code rewrites. This keeps the core stable while allowing changes at the edges as APIs, providers, and models evolve.
Advanced functionality
We’ve designed the platform to support many tools, with a primary focus on getting search and document retrieval right. We use Azure AI Search, rather than hosting our own indexing stack, for its scalability and managed reliability, and we rely on its hybrid (vector + keyword) retrieval across our document bases. At the same time, we run custom indexers for SharePoint, proposals, project logs, and policies, with logic for domain-specific chunking and metadata tagging aligned to each use case. This is where we lift quality versus generic copilots as we reduce noise and improve answer relevance.
Front-end experience
The React front end manages app-level state for authentication, model selection, and multipanel UIs. Background data fetching keeps interactions responsive, and the UI is built for streaming: tokens, tool calls, and citations are shown as they arrive so users can see system activity and intervene when needed.
What’s next
We see an IM_GPT that is evolving from a single app into a platform that supports everyday work end to end. In 2026, we plan on expanding agentic workflows that can plan, call tools, and draft outputs for review, so routine desk research, content production, and analysis can run more autonomously and work better for us. Everything remains grounded in company knowledge with citations, permission-aware retrieval, and auditability, while we add more enterprise connectors and reusable tools. The goal is to reduce friction in the workday without compromising governance.

Key takeaways
Rolling out AI is a change programme. We deliberately launched core functionality early, trained champions, and expanded features as colleagues became proficient from “ask anything” to task solvers and then to department spearheads. The cadence matters. Iterate fast enough to learn, but not so fast that users can’t keep up. Let usage decide the next feature release. • Start from your workflows, not from a model choice. • If compliance and cost control matter, a small platform layer can outperform enterprise solutions from frontier labs like OpenAI or Anthropic, provided you have the AI engineering triangle to run it. • Ship core value fast, teach people how to use it, and scale what sticks. • Maintain a modular stack and design your architecture so that changing providers does not disrupt operations.


