In this episode Craig McLuckie, co-creator of Kubernetes and founder/CEO of Stacklok, talks about how to improve security and reliability for AI agents using curated, optimized deployments of the Model Context Protocol (MCP). Craig explains why MCP is emerging as the API layer for AI‑native applications, how to balance short‑term productivity with long‑term platform thinking, and why great tools plus frontier models still drive the best outcomes. He digs into common adoption pitfalls (tool pollution, insecure NPX installs, scattered credentials), the necessity of continuous evals for stochastic systems, and the shift from “what the agent can access” to “what the agent knows.” Craig also shares how ToolHive approaches secure runtimes, a virtual MCP gateway with semantic search, orchestration and transactional semantics, a registry for organizational tooling, and a console for self‑service—along with pragmatic patterns for auth, policy, and observability.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- When ML teams try to run complex workflows through traditional orchestration tools, they hit walls. Cash App discovered this with their fraud detection models - they needed flexible compute, isolated environments, and seamless data exchange between workflows, but their existing tools couldn't deliver. That's why Cash App rely on Prefect. Now their ML workflows run on whatever infrastructure each model needs across Google Cloud, AWS, and Databricks. Custom packages stay isolated. Model outputs flow seamlessly between workflows. Companies like Whoop and 1Password also trust Prefect for their critical workflows. But Prefect didn't stop there. They just launched FastMCP - production-ready infrastructure for AI tools. You get Prefect's orchestration plus instant OAuth, serverless scaling, and blazing-fast Python execution. Deploy your AI tools once, connect to Claude, Cursor, or any MCP client. No more building auth flows or managing servers. Prefect orchestrates your ML pipeline. FastMCP handles your AI tool infrastructure. See what Prefect and Fast MCP can do for your AI workflows at aiengineeringpodcast.com/prefect today.
- Unlock the full potential of your AI workloads with a seamless and composable data infrastructure. Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most - building intelligent systems. Write Python code for your business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. With native support for ML/AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions. Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch. Build end-to-end AI workflows that integrate seamlessly with your existing tech stack. Join the ranks of forward-thinking organizations that are revolutionizing their data engineering with Bruin. Get started today at aiengineeringpodcast.com/bruin, and for dbt Cloud customers, enjoy a $1,000 credit to migrate to Bruin Cloud.
- Your host is Tobias Macey and today I'm interviewing Craig McLuckie about improving the security of your AI agents through curated and optimized MCP deployment
- Introduction
- How did you get involved in machine learning?
- MCP saw huge growth in attention and adoption over the course of this year. What are the stumbling blocks that teams run into when going to production with MCP servers?
- How do improperly managed MCP servers contribute to security problems in an agent-driven software development workflow?
- What are some of the problematic practices or shortcuts that you are seeing teams implement when running MCP services for their developers?
- What are the benefits of a curated and opinionated MCP service as shared infrastructure for an engineering team?
- You are building ToolHive as a system for managing and securing MCP services as a platform component. What are the strategic benefits of starting with that as the foundation for your company?
- There are several services for managing MCP server deployment and access control. What are the unique elements of ToolHive that make it worth adopting?
- For software-focused agentic AI, the approach of Claude Code etc. to be command-line based opens the door for an effectively unbounded set of tools. What are the benefits of MCP over arbitrary CLI execution in that context?
- What are the most interesting, innovative, or unexpected ways that you have seen ToolHive/MCP used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on ToolHive?
- When is ToolHive the wrong choice?
- What do you have planned for the future of ToolHive/Stacklok?
Contact Info
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- StackLok
- MCP == Model Context Protocol
- Kubernetes
- CNCF == Cloud Native Computing Foundation
- SDLC == Software Development Life Cycle
- The Bitter Lesson
- TLA+
- Jepsen Tests
- ToolHive
- API Gateway
- Glean
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems.
[00:00:19] Tobias Macey:
When ML teams try to run complex workflows through traditional orchestration tools, they hit walls. Cash App discovered this with their fraud detection models. They needed flexible compute, isolated environments, and seamless data exchange between workflows, but their existing tools couldn't deliver. That's why Cash App relies on Prefect. Now their ML workflows run on whatever infrastructure each model needs across Google Cloud, AWS, and Databricks. Custom packages stay isolated. Model outputs flow seamlessly between workflows. Companies like Whoop and 1Password also trust Prefect for their critical workflows, but Prefect didn't stop there. They just launched FastMCP, production ready infrastructure for AI tools.
You get Prefect's orchestration plus instant OAuth, serverless scaling, and blazing fast Python execution. Deploy your AI tools once. Connect to Claude, Cursor, or any MCP client. No more building off flows or managing servers. Prefect orchestrates your ML pipeline. FastMCP handles your AI tool infrastructure. See what Prefect and FastMCP can do for your AI workflows at aiengineeringpodcast.com/prefect today.
[00:01:29] Tobias Macey:
Unlock the full potential of your AI workloads with a seamless and composable data infrastructure. Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most, building intelligent systems. Write Python code for your business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. With native support for ML and AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions. Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch.
Build end to end AI workflows that integrate seamlessly with your existing tech stack. Join the ranks of forward thinking organizations that are revolutionizing their data engineering with Bruin. Get started today at aiengineeringpodcast.com/bruin. And for DBT cloud customers, enjoy a $1,000 credit to migrate to Bruin Cloud. Your host is Tobias Macey, and today I'm interviewing Craig McLuckie about improving the security of your AI agents through curated and optimized MCP deployments. So, Craig, can you start by introducing yourself?
[00:02:35] Craig McLuckie:
Yeah. Hey. Thanks for having me on. I'm Craig. I'm the founder and CEO of Stack Lock. Previously, you know, my my sort of background is entirely in the, distributed systems enterprise software domain. So I I worked at Google where I was responsible for bootstrapping and shipping, Google Compute Engine. I then had the opportunity to look beyond and came up with the idea for Kubernetes with a couple of other cofounders of the project. That obviously worked out pretty well. I started the Cloud Native Computing Foundation as a spiritual home for Kubernetes and other open source projects.
And then this is I'm currently in my second company, so I I worked a lot in the kind of relatively traditional distributed system space. And, at Stack Lock, we became really interested in AI as a as a sort of disruptive moment, sort of this epoch boundary. And so we started really focusing on what we knew, which is the integration and the the sort of interfaces between existing systems and these these new assistive and agentic technologies. And, it's great to be on your show.
[00:03:38] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:03:42] Craig McLuckie:
Yeah. I mean, it's it's interesting. And and for me, it was, you know, the sort of journey of fascination and increasing kind of interest. Right? Like, I I mean, I think we all had this moment when, when ChatGPT was released, and it was like, wow. This is just a wholly different class of system. It was fascinating and started playing with it as a curiosity. And I always like, I think I was an AI denialist for quite some time. Right? I think a lot of people probably were, where I think we sort of a lot of folks, certainly me, I'm an an old kind of school distributed systems guy. I imagine that it it was going to be asymptotically convergent on a set of capabilities that would make it interesting and a curiosity in in certain domains.
And then we'd start to see it emerge as this kind of new class of system that you could use to augment what you were already building. And I think, you know, as my own use of the technology started to emerge as just, you know, in in day to day use initially around just research based functions, but then increasingly as I wanted to do things that I didn't wanna bother my engineers with that that would enable me to actually start writing code. It became clear that I was deeply undervaluing the the impact that this really was this, you know, kind of epoch defining moment. I think every epoch of humanity is kind of defined by its dominant technology. Right? The stone age, the iron age, the digital age. And I I do think, you know, it became clear to me that we were entering into the space. And for me, it was interesting because, you know, we started this company around a a certain supposition, this idea that 80% of the world's software, you know, historically has been written by random people on the Internet. You know, open source libraries are the the staple. And I started Stack Lock as a way to start, you know, just helping build this bridge between, you know, the the way that people are consuming software in the communities that are producing it. And I think the the the journey for me as as a sort of AI maximalist and and building an AI maximalist company really started when I have asked myself, are my assumptions for this company still going to be true in five years? Right? Like and are they gonna be true in two ways? Are they gonna be true in terms of are people still building software the same way? Right? Like, are they still using the same SDLC that they've always used, or is the role of the developer going to change?
And then the second piece was, you know, is the open source ecosystem that powers this going to be consumed via package managers, or is it increasingly gonna be consumed by a lot of that knowledge being parameterized and then being used to actually render up entirely optimal expressions of what you want. And it was clear that, like, this the this the the disruption that this represented is is profound. And I think a lot of people that are you know, like, I was pretty involved in what we call the kind of cloud native disruption. I have to sort of transition from, you know, building, you know, relatively simplest you know, relatively complex, but, you know, simply describe vertically integrated apps to, you know, kind of using things like Kubernetes.
And this felt every bit as, you know, significant in fact, far more significant than that. And so that that moment of realization really drove me to start looking at where to create value in this ecosystem. And I would say that that that that moment was, you know, defined by my own experiences with AI, but then also this recognition of the just the profundity of disruption that it represented.
[00:06:56] Tobias Macey:
And one of the standout aspects of that knowledge curation and knowledge delivery aspect of the, I guess, we can call it agentic revolution, for lack of a better term right now, is the model context protocol, and that has become broadly implemented and, in some cases, broadly adopted. The the the adoption is maybe a bit unequal. And I'm wondering if you could just talk to some of the ways that you're seeing that as an exemplar of this new way of thinking about system design, application delivery, and the ramifications that it has on the development and delivery of software?
[00:07:39] Craig McLuckie:
Yeah. I think, you know, kind of MCP is is interesting. And, you know, the the the best analog I can come up with you know, like, I always think back to my own lived experiences. And I remember, you know, I remember the first time I saw Docker, which was, you know, you know, I don't know how familiar this audience is with Docker, but it was a technology that enabled you to kind of package up, you know, an application into a a reusable entity and ship it and run it in any environment. And it was this kind of moment. Right? Like, wow. We now have the ability to unlock portability, and we've we've created this lightning in a bottle moment, which redefines the developer experience. You know, it just it just changed it changed the way that people worked. And I think MCP is is every it's very similar to that. Right? Like, it's it's you know, the the beauty of what was produced was this incredibly simple specification that described a way whereby you could translate, you know, this this ocean of API and data and, you know, everything that existed in the world's distributed systems into natural language format that was sort of optimally consumable for for models. It was just this very, very simple and elegant way to start describing it. And so, you know, as I've looked at this, you know, obviously, the protocol is constantly evolving. But I think if if you kinda cut to the chase, like, what does this actually mean in its most fundamental terms is this is a gateway to building AI native apps and AI native experiences. Right? Like, you know, we've we've looked at a lot you know, I look I speak to a lot of enterprise organizations, and, you know, a lot of folks are are are working on this idea of, like, hey. Can we, both we we really do two things at the same time. Right? Like, people are trying to do two things concurrently. One is, can we use AI to enable us to go faster? Can we enable AI to let us, you know, do the work that we're doing today in a in a in a sort of in a much better space? And then the second thing they're trying to do is, can we build out these competencies that allow allow us to be future proof, right, that allow us to be able to start, you know, kind of delivering, you know, outcomes over time and outcompete our competition. And it's it's interesting to see the tension there because, you know, the first thing often says like, hey. Just go buy a vertically integrated AI application. And then the second thing says, hey. You actually have to start reasoning about this from a platform builder's perspective and think about what you need to do. And those two those two things are intention. I think MCP is is this beautiful technology that enables you to start spanning that. And what it leads to is the ability to build AI native applications. Like, a lot of these applications that are being built today that are are sort of, you know, wildly successful and being consumed at a furious rate because they bring value engineering. They actually solve the problem of of rendering context in a way that models can understand to create business value feel a little bit like, like the horse's carriages. I'm not the first person to say this. I think VCs have been saying this for a while. But, you know, when the automotive, you know, industry started, you know, when when the internal combustion engine emerged, people just put that onto a carriage. It was a horse's carriage. Right? Like, it was not a modern automobile by any stretch of the imagination. And I think AI native apps mean that people are interacting with a model in a multimodal form factor. And right now, when we say, you know, multimodality, people often think, hey. It's voice, it's image, it's video, it's it's whatever.
But multimodality could also be rendered UI. You know? Like, there's some things when you want to be able to see a calendar. You don't want something in plain text describing a calendar. You just wanna see a calendar. Right? Maybe you wanna click the calendar. Maybe you wanna move things around. And so as we look to where app development and app architectures are going, the model becomes the presentation and view model for modern apps, which then raises the question, like, you know, like, how you know, if you're building an application today and you you're building a web application, you know, the API is is is the interface between your presentation layer and and the kind of back end systems where you kind of organize logic and you you you you do all of that. And so I really think that MCP is the API for AI native applications, and it's going to herald a whole new class of experience where you no longer have to have a a bunch of vertically integrated apps where you're constantly context shifting between them. And it it will offer up two things. One is the ability to establish and share context if you do need different experiences, but more importantly, to be able to start to establish the model as your working area. This is the place where you do your work, renders things up in a variety of different form factors. You could call that the the browser for, kind of, you know, AI enabled applications.
And it means that you you you will will see developers starting to think not about, you know, how do I actually create this kind of vertically integrated story. It's really, how do I start to think about organizing workflows? What are the nouns and verbs associated with the things that I need to do? How do I express those in a way that, you know, something that's already powered on natural language and increasingly some of these other concepts that will sort of show up in the protocol can consume it in a meaningful way and create a great experience? And then, oh, by the way, I also have to do all of the details around authentication, authorization, observability, and all the other things that come with with with building applications.
But I think that is the most fundamental transition that's going to happen.
[00:12:35] Tobias Macey:
And in that context of MCP, there are a variety of different frameworks and implementations around it. The protocol itself has gone through a few revisions, most notably going from not having any concept of authentication or authorization to it actually now being well integrated into the specification. And I'm just curious if you can talk to some of the ways that you're seeing people approach the overall exploration and adoption of MCP and then going from, hey. This seems like a useful little tool to, oh, this is actually going to be a substantial benefit to me, my team, my organization, and now I need to actually get it into production and just some of the, I guess, misconceptions or stumbling blocks that folks run into along that evaluation path.
[00:13:22] Craig McLuckie:
Yeah. I mean, there's there's a lot of ground to cover there, but, you know, I I would say, you know, the the starting point for a lot of people is holy crow. This thing demos really well. Right? Like, it's you know, there's an upstream you know, like, the way I see a lot of folks you know, maybe I can just describe the the sort of the canonical journey for a lot of folks. Right? Like, you know, the starting point's often like, hey. I'm using cursor, and I want to integrate something else. There's probably a tool for that. Right? And, you know, there's a there's a sort of incredible burgeoning numbers of tools that can be used. And the starting point for a lot of people was like, hey. I'm using an assistant technology. I wanna integrate it to something else. I can just run this thing locally. I can just go fetch this thing from a a package repo up there. It's like, effectively allow, you know, cursor copilot, whatever to NPX run this package, and then I just provide that package a set of credentials because now authors are part of the flow. And that package that I'm running is now able to access a lot of these services, and it'll do the mapping for me. And that's that's great. Right? Like, it's it's really exciting. It's also absolutely terrifying. Right? If you actually think about what like, just take a moment to think about that. Right? Like, don't get me wrong. I'm a huge proponent of this technology, and I think that there's so much potential with it. But I think it's also important to recognize this one truth. Right? I can't remember the exact statistics, but there are, I think, at the moment, something like, you know, many, many thousand malicious NTP packages out there. And you're basically, you know, kind of giving a lot of your, you know, kind of your your, you know, like, we we we we we're giving up a lot of common sense in terms of how we start to run these things. For real world enterprise organizations that are are are starting to look to deploy these technologies, the sad truth is that the rate of MCP based exploits is growing at a pace that is outpacing AI spend. Right? Almost 20% of organizations that have embraced MCP have had some kind of of of exploit from it. Right? So starting with this idea that, yes, you can NPX run something, people do that. And it creates a tremendous amount of value, but it also is exposing you to a lot of risk. So people will be like, okay. Let's not do that. The so so the next logical point is, hey. This is great.
We need some kind of hardened surface so that we know this thing can be trusted. So the next consumption pattern tends to be what I think of as as SaaS based integrations. Right? So when we're increasingly seeing a lot of this, where, hey. You have a a SaaS based system that you wanna use, and now I need to be able to connect that SaaS based system to this thing. And and, hey, conveniently, a lot of them are now starting to expose MCP based endpoints. You know, they're actually doing the work on the SaaS side to start reasoning about how to transform their API into these, these basic primitives that can be can can be operationalized.
And that addresses, you know, some of the problems. Like, it it certainly solves for the problem of the malicious MCP package and, you know, getting hurt and having to reason about it. And it what, you know, what it does do is it starts to give you a very kind of nicely normalized set of interfaces that you can do, and you you can get a certain, you know, kind of path down the journey. The thing that I see a lot of organizations and and developers stumbling with is, like, okay. Well, that's great. But, like, how do I start to expose my data? Right? Like, it's it's a transition from, yeah, there's all these kind of SaaS integrations that are available. They're consumable. How do I expose my data, specifically, and how do I start mixing my data into the toolset that that people are using? And that's that's kind of problem one. And I think problem two is, you know, what people are starting to realize is tool pollution is a real thing. Right? Like, it's it's you know, context one does a finite input tokens cost money. If you go and turn on one of these servers, you are gonna burn a lot of tokens. Right? Like, the GitHub server is we sort of saw this evolution where it kind of grew to about a 110 tools. And then GitHub really started thinking, like, hard and seriously about how to kind of optimize the set of tool interfaces, you know, for the specific workflows that developers bring so that the tool count kind of came down. You know, we've seen we've set set some diminishment, but you're still probably adding something like 10,000 input tokens by just turning on that server. And so, you know, once you start to get to a point where you have a lot of tools, the selectivity becomes a problem. Right? Like, it's like, hey. I have to actually start manually turning them on and off. I have to reason about when I want them. And so that that problem around, you know, kind of, you know, adding, you know, semantic search and and and sort of semantic awareness. And and we're seeing, you know, some some, systems start to implement this. Like, you know, Anthropic recently added their own, semantic search, you know, like, search to tools so you could start to aggregate them and and not have to, you know, have all of those input tokens in every in every window. But, yeah, like, the the the next point that people start to hit on that journey is how do I expose my data, and then how do I actually make this halfway sustainable so I'm not just, you know, burning a a a tremendous number of tokens. And then, you know, I think the the the sort of step that happens after that, and I'm happy to provide an example of this, is how do I start mapping this to my actual workflow? Like, what are the nouns and verbs that I care about? Right? Like, you know, like, I'll give you an example. A recruiter. Like, you know, you know, I I I manage a recruiter, and he lives in, you know, basically four tools. I mean, there's a couple others he uses, but but for simplicity's sake, we just describe it as four tools. He uses the Google productivity suite, so Gmail and Docs. He uses Greenhouse for candidate management. He uses LinkedIn, and he uses, Calendly to to to schedule interviews. Now he can get to a certain point by just integrating those systems into his his work. Right? So he can, you know, integrate LinkedIn and integrate this and integrate that and integrate these these pieces. But there's still some amount of context shifting. Like, candidate does not mean anything in a normalized way in that system. Right? Job description is a relatively open ended thing. But by looking at his workflow and starting to describe the things that he does, like, hey. I need to source a candidate. You know? What does that mean?
You know? What, like, you know, what is our process here? How do I start to describe that as part of the tool itself? How do I start moving from, you know, like, I wanna write a job description. Well, that's great. There's already a template, but do I have to go and describe you use the template to do this, or can I start to describe write a job description as a tool that that basically you know, like, here's the base, you know, here's the base thing? You know, here's the the the transcript of the interviews that I'm I'm using. Let's kind of, you know, start to kind of organize the structure of that around the work that the person's actually doing. And and so I think of that as the sort of that sort of the the the logical destination where we need to get to where, it's not just about a set of, you know, kind of, you know, vanilla interfaces where you're putting a lot of cognitive load on the model.
And, you know, all of the payload that's being kind of expressed there is is is relatively abstract. It's it's it's relatively generic, but but actually starting to become much more precise through the use of tune tool descriptions, through the units of specific tool based workflows to actually start fitting the work to the person that's that's that's driving it.
[00:19:56] Tobias Macey:
On that point of the selection and evaluation of the tools that are exposed in these various servers using GitHub as an example, it can be very manual to understand which tool does the thing that you actually care about for a given workflow. And to your point, I don't wanna have to deal with toggling them on and off every time I wanna do something different. And I'm curious how you're in particular in the, project that you're building with ToolHive, thinking about that workflow of being able to define the tasks that you want to do, map that to the tool selection within these various servers. And I know that in tool hive, you, for instance, have a way of building a composite server that exposes all the tools for multiple backing servers and just how how that tool selection and context optimization workflow looks for somebody who is starting to figure out what are the things that I want to be able to do and what are the pieces of information that I need for those different workflows.
[00:20:56] Craig McLuckie:
I mean, you know, I tend to take a somewhat you know, I I don't know. Like, I'm sure the audience is you know, like, a lot of folks have have have read the paper the better lesson. Right? And I take I take a somewhat reductionist view on on what really works out there right now. And and the answer to what what really works well right now is frontier models plus great tools create great results. Right? But but you have to start really thinking about what a great tool is. And so I think there's a there's a there's a there's a number of different steps in this regard. Tool selection is hard. Right? Like and there there is an art to describing a tool so that it is actually invoked in a specific in a specific context. You know, there are some there are sort of things you can do, like, right out of the gate. So, you know, for instance, what we what we started, you know, with with with Tool Hive right out of the gate was just adding semantic search into the into the into the call path. Right? So, you know and and we we we did a bunch of analysis around this. So we basically picked a bunch of the most popular tools out there. We described you know, we generated a a big baseline of synthetic, kind of data, and then we just started running evals, you know, against, you know, query should map to this tool.
You know, like, is you know, are the various models being effective in tool selection? And it was kind of interesting because, you know, in you know, when you look at the corpus of servers that are published, these are just generic servers out there. We see frontier models like the the best of the best right now, you know, based on our heuristics, achieving about a 93 to 94%, selection rate. So 93% of the time when, you know, you have a query which maps to a tool, those models will select the the tool. Interestingly enough, if you're if you're in the business of of of of using smaller models, like less heroic models that you know, because inferencing cost is expensive or you really care about the latency of these queries if you if you try to do something in in more real time. The performance of those systems degrades very quickly. Right? So, you know, when you move from a frontier model to, like, a a sort of, you know, a cheap and cheerful SLM, you know, tool selectivity falls off to something like 34%, you know, based on on our heuristics. And so as a starting point, just putting a semantic search capability in front of these. So instead of basically saying, here's a flat set of tools, pick one. Saying, find tool and basically being very careful and spending a lot of time, you know, building a very accurate tool description, which is this is the endpoint that you call every time to you know, whatever the whatever the the the sort of description of the tool that generated models hitting this point. And then, you know, delivering top k, you know, like the like the top six tools, you know, allowed us to achieve very high rates of tool selectivity across every model class. So we were able to get to a point where a much more modest model could achieve the same level of tool selection for a given task. But even in that case, you're still sitting at, like, you know, at best case scenario, about 93% of the time, the actual tool will be will be, mapped to the to the task that you're handling. And so this is where evolves come in. Right? Like, this is one of the most, I think, significant points about operating with these class of systems. Right? Like, when you go back to the world of old school distributed systems, proof of correctness was the unit test. And the unit test, you you get the thing to run. And if you have enough test coverage, you can be pretty confident that you built the thing that works. These are stochastic systems. Like, you're dealing with probability intervals and boundaries. Right? And so you have to you have to start with an evolve you have to start with an evolve framework. Right? So so if you're if you're, you know, if you're a a sort of generalist user just using these for convenience sakes, like, yeah, 93% tools like to be, that's fine. Like, you know, I I just redo it and occasionally it falls down and irritates me. I burn a few tokens, but no worries. But if you're serious and you're trying to build an agentic system that is performing a task, you have to see the whole loop, and it has to start with Evals.
And so, you know, one of the things that we always encourage people to do is, like, hey. You wanna start building tools? Like, what is your eval framework? Like, have you got that set up? Like, what are the set of queries that should actually result in in just starting with tool selection look like? And then, you know, can you start to set up a a a an evaluation cycle so you can start to tune your your tool description to match the task so that the tool is selected? And then once the tool is selected, can you start to tune the payload so that the task is completed? Right? So you have to start, you know, doing this this kind of deconstructionist way of of looking at it. So the second thing that we we sort of, you know, tend to recommend when people are are walking on this journey is is is, you know, run an Evol framework. We're actually working on one, we'll ship it momentarily so that people have a a nicely integrated Evol system as as part of the tool hive platform. And then, you know, the the question becomes one of, like, how do you start to actually structure workflows? Right? You know, you you both wanna do two things. You wanna kind of optimize the context, and and I think of it as context real estate management. You wanna load that context window up with, you know, precisely the right information so that it can start to perform well. And it's it's interesting because that's not always obvious. Right? So, like, let me give you an example. Like, if you're working in the sort of geospace, if you're if you're building a g a GIS enabled application and you say feature, that means something very specific. Right? It's a collection of vectors that describe an entity that actually exists in physical reality on the terrain. If you're a developer and you say feature, you mean, hey. It's an atom of work that I'm using on that I'm using. Right? So there there has to be this kind of contextual awareness where the tools that you're building have enough additional metadata to start describing the thing that they're presenting so that it's actually you know, when it's when when you were inferencing against that thing, it's actually generating the right results.
And so that that process of building tools, you know, becomes relatively complex. And I think there's really kind of three things. Like, one is you're getting the right nouns and verbs and optimally describing them. Two is starting to reason about workflow. Right? Like, it it makes sense to start to think about, you know, workflow associated with something and being able to, you know, represent an orchestrated set of actions that generates an optimally, you know, rendered view. Because, you know, by by doing that in a sort of one shot with structured workflow, it's avoiding a lot of round trips so you don't get that kind of cascading enter entropy that you get, you know, when you when you're running a lot of, a lot of sequential operations. And then the third piece of it is is transactionality.
Right? Like, you know, as you're starting to move from, hey. I'm just trying to build MCP servers that reason about, you know, getting context into context when this can answer questions to actually performing work. Sometimes that work needs to be committed in a sort of transactional way. So you need to start thinking about, like, hey. If I'm updating, you know, three things across two databases, how do I create a shared transactional semantic, so that, you know, if the operation fails, everything is reverted. So I'm not starting to try to debug these weird, you know, kind of orchestration failures on the on the back end. And so I I don't know if I'm I'm getting too far ahead of myself here, but I I I think there's there's a lot of richness in in going from, hey. I I I'm MPX installing off the Internet, and it's connecting to GitHub, you know, using my credentials. And I'm just going, yeah. Yeah. Yeah. Yeah. Approve approve. I'm already paying attention. To building a system which is unlocking, like, AI enabled application development with these highly optimized and highly reasoned about, you know, views that are tuned to the specific workflow that you're trying to accomplish. And there are a couple of directions that I wanna go from here. I think one thing that you mentioned
[00:27:49] Tobias Macey:
that is worth digging into a bit more is that concept of evals and the relationship to distributed systems practices where you have unit tests. But for cases where you wanted to be even more thorough in the validation and verification of those systems, you would likely reach for something like TLA plus for design for describing this is the way that the system is supposed to work and running that through some mathematical proofs of the ways that it should behave in that distributed systems context. And then the idea of Jepsen tests for being able to see, okay. But how does this thing actually break when I put it into production and really put it under extreme load? And are those behaviors when it does break acceptable for the use cases that I'm planning for? And I'm wondering how you think about the ways that those two approaches map to this newer world of these orchestrated systems using MCP and context engineering and these probabilistic workflows and just being able to understand how do we actually do something of that same degree of rigor.
[00:28:51] Craig McLuckie:
Yeah. It's an interesting question. You know? And I've I've thought about this a little bit. Like, look. Don't get me wrong. I'm not, like, super deep on, like, formal method analysis. But but here's here's here's the way that I I I kind of reason about this. We have this, like and I think this is a a common I I find myself falling into this trap all the time. Right? Like, it's you know, like, I I I grew up in distributed systems. The first distributed system I worked on was Windows n t three five one clustering. It's kinda dates me. Right? Like, I've been a distributed systems guy since time immemorial. And one of the things I've learned is, you know, we have this pattern where we we start to kind of try to apply what we know. Right? So it's like, hey. You're dealing with stochastic system. The stochastic system is generating imperative code. I can at least reason about that imperative code. Right? Like, so this this is the sort of what I think of is the the sort of, you know, sort of normal, you know, sort of systems engineering thinking. And I think there's some there's value and veracity to that. Like, you know, certainly, these tools give us the ability to start expressing behavior in terms of of of formal mathematical language. It's like it's like you you suddenly have this this kind of you know, like, Amazon's been really good at this. Right? Like, they've hired up this kick ass team of of PhDs.
They've done a bunch of formal, you know, kind of method analysis. You know, the stuff that was previously reserved for folks that were building flight controllers for Boeing or control systems for nuclear reactors. They've built up the muscle to start applying that to things like the s three serving path. Right? And they find some very, very subtle and very nuanced and very interesting things. And they certainly are looking at ways to kind of genericize that. You know? I I don't know what they're doing, but I have to imagine that they're that they're looking at this because suddenly you have the system, which has PhD. They they behave like a Stanford PhD. Right? Like, they can actually do the work to start describing these things in a in a formal method way. And so there are going to be classes of integrations where like, I think this is kind of there's always something that's new and invoked. Like, this idea of, hey. Rather than using formal MCP, just describe your sort of APIs, you know, through a a sort of JavaScript based, you know, sort of format or or decorate them with an open API thing. Allow the agent to vibe code something. Run-in an isolated sandbox. If you really, really care about it, you know, apply formal method analysis to it and and reason about where the breakage is. But I think that's missing the point. And the the the thing that I come back to is these are stochastic systems. Right? Like like, what you're doing when you're inferencing is effectively it's a probability field or like, I don't know, though. I'm not I'm not I'm sure there's AI experts that are listening in that are are kind of gnashing their teeth and and being angry with me because I'm probably using the wrong words. But, you know, you've got a system that's basically you know, like, with the Frontier model, A way to think about it is it's it's an encapsulation of the world's public domain information. Right? Like, 4% of 4% of of the world's knowledge is in the public domain. And I'm reasonably sure that all of that is being used to generate these frontier models. Right? But 96% of it isn't in the public domain. And so, you know, a lot of what you have to do is is start to craft context that gives you the highest probability of, you know, during inferencing, you know, you know, finding the logically connected set of tokens that have the lowest cosine distance from, you know, what what you want. Like, this is it's a stochastic it's a stochastic process. And the problem with that is that, you know, anytime something changes, anytime you add, you know, kind of entropy and that could come through a lot of different ways. Right? Like, you know, I had the system set up, and I was using three tools. And now I'm using five tools, and now I'm using 10 tools. And now I'm starting to put pressure on the context window, and it and so the self attention mechanism start to decay. You know, that's something that's changed that I I wasn't necessarily aware was gonna change. User patterns change. People start to talk about things. You know? New new slang emerges. Like, there's a lot of different sources of entropy that start to show up in these systems. And so it's not enough. And all the parameterized data changes. Like, you spend a whole bunch of time running evals and getting things just working. New model comes along. It's got new parametric data. What was optimized is no longer optimized. So I think there's definitely a veracity in saying, hey. You know, there's ways to start introducing more structure. There's ways to start reasoning about these things so that when I really, really care about it, I can start to create more formal boundaries around it by isolating it, or I can start to create more determinism by describing that. But those are imperative coding concepts, and those only apply to the imperative parts of the system. The rest of it is still that is the the thing that's actually driving most of the value creation is a stochastic system, and that is given to the vagaries of entropy.
And, you know, there's a lot of source of entropy. So I don't think you can ever give up on evals. Right? Like, I think these you have to change the psychology of building and running these systems. The things that like like, you know, you can get it working, but it's not like a distributed system where you can reasonably assume it's just gonna continue to work. You have to get it working, but then you have to have a continuous eval framework in place. So that, like, to get it working, you have to run evals. To keep it working after an eval. And you have to have the ability to constantly tune it because the behavior's gonna change. Right? Like, this is sort of what is the SRE for AI systems. Right? Like and and and maybe SRE for AI systems is other AI systems. Right? Like, you know, like, I'm not saying that this has to be a human function, but you have to have the watches. Like, you know, like, if something's gonna you know, it's not enough to just deploy it. It it like, it won't continue to run if you don't continue to watch it.
[00:33:56] Tobias Macey:
And then another interesting development that still seems fairly niche and hasn't gained a lot of adoption or even really proven out its overall capability is the idea of these large action models as opposed to the current transformer models where the idea is that rather than them being for conversational purposes, their goal is to be able to take instruction and then make some sort of action on it. And I'm just wondering how you're seeing broadly the evolution of these generative models, the new architectural paradigms, and the interfaces that those models presume changing the types of tooling that we need to be thinking about and whether MCP will be eventually obviated, if there will be other interfaces that we need to be able to support for maintaining context for different model architectures, etcetera.
[00:34:48] Craig McLuckie:
Yeah. It is interesting, and it's it's something that I have certainly given some well, like, I'm not I'm not an expert on, you know, kind of the evolution of model architectures. Like, I I do take the better lesson pretty seriously. Right? Like, I think that at the end of the day, there's no amount of, you know, like, you know, kind of you know, like, like, you know, like, this is the danger. Like and I think a lot of enterprises are really struggling with this. Right? Like, it's it feels kind of hopeless. Right? Because you you start a you start a proof of concept on something. You think you can create value that's gonna differentiate you versus your competition. You get to a certain point, and then you're just overwhelmed by the next frontier model comes out as capabilities are disproportionately different. You know, things are, you know, things are things are moving so quickly. And, you know, I think we we certainly haven't seen asymptotic convergence around capabilities to system yet. So what I kind of really focus on, and I think this this kind of I always describe this in in two ways. Right? Like, there's the Emerald City, and there's the yellow brick Road.
Right? And we don't know what the Emerald City is gonna look like yet. We haven't seen it. Like, you know, we can see a glimmer in the horizon. We know it's gonna be fantastic because, you know, every every single, you know, points that way. So the question is, how do we make progress in the meantime, and what are the set of things we can do? And, like, I think of this as a kind of future proofing exercise. Right? I think there's these hard, immutable truths about, you know, bringing systems into into into existing environments. Right? Like, you know, one, like, you know, one hard immutable truth is that, you know, what is regulated will stay regulated. Right? Like, you know, like, you know, the the the the the like, you know, there may be ways to help you deal with regulatory oversight, but regulations tail technology by a pretty significant margin. And that that that regulated landscape isn't gonna change anytime soon. So there's a set of things that you just have to be able to do. Right? Like, you just have to be able to authenticate, authorize a certain policy, you know, generate observability, auditability.
You have to be able to track provenance of things. Like, there's a bunch of sort of ooky enterprise things that need to be done. And I think that, you know, as as we see this disruption, like, the slowest moving part of the disrupted landscape is the thing that touches your existing systems. Right? Like, you go talk to a bank and you ask them to describe their workflow. There's always a mainframe, like, every every time. Right? Like, so you just if you if you haven't seen a mainframe yet, you just haven't asked the right questions. Okay. Like like, there there's always a mainframe in there somewhere. Technology's fantastically slow moving because at the end of the day, you know, a company is this weird enmeshed union of its people, its processes, and its technology. And you you can't change just one of them. Right? Like, they all have to kind of change together. Like, it's and they are that is what is running the world right now. Like, what's running the world isn't these models. It's it's the existing systems. Like, if I wanna get money out of a bank account, I you know, I'm I'm not touching a model. I'm touching these existing systems. And so I think the safest place to, you know, innovate, well, not so much innovate, like invest, You know? It's like, obviously, you know, have access to these models, like, figure out what they can do. But at the end of the day, you have to have this interface, and it has to be controlled. And it it has you know, like, you have to be able to reason about it. Right? Like, it's if you're an organization, like, I I I spent a lot of time talking to, you know, grown up companies and, you know, like, hey. You know, we can't have these things NPX installed random shit off the Internet. Maybe they'll eventually, you know, do something, but we can't have them running unsupervised code that's touching production systems. Maybe one day we will. Maybe in five years we will, but that's not gonna happen today. I but what we can do today is is, you know, is put a boundary around it, you know, start to reason about the identity so that, you know, the action of this thing is is showing up in its own context. We can reason about, like, auditability, traceability.
We can start to, you know, assert, you know, kind of, you know, controls around those those pieces. And so, look, 96% of the world's information is behind the firewall. You have to be able to access it, and you have to have a pragmatic way. MCP might be it. Maybe it's not. You know, maybe we'll come up with some new cool thing, you know, next year. But, you know, that so that's that's thing one. Like, you just have to unlock that in a way that's pragmatic. And I think investing in an MCP based platform is the best way to do that right now. But the second thing is that sort of 96% plus the 4% of public domain knowledge only represents point zero zero zero one synthesized into, you know, human consumable kind of, tokens. The rest of it is, you know, time series data, video streams, like all these other things. Like, you're not gonna want to inference on that. You probably can't inference on that. It doesn't even make any sense. So you are gonna need systems that can start to translate that ocean of information that is not knowledge into something that you you can actually start to feed into these systems. And so I I do think that, you know, starting the journey with MCP is entirely logical because there's so much obvious value right now. Yes. The state of the art will change. It will be different in a year. It will be different in five years, but you will still need to authenticate, authorize, observe. You will still need to serve policy. You'll still need to do all of these things. Building your muscle on that now is sensible because whatever new technology comes along, you will probably be able to reuse a lot of that. And and if you make a platform investment in the space, the platform will fall to keep up with it. And digging now more into tool hive and that platform question,
[00:39:46] Tobias Macey:
what are some of the core capabilities and core ideals that you're focusing on with that as a means of building that foundational layer for this context engineering and context exposure to allow teams to be able to have a reliable means of being able to actually interface with that broad corpus of information that isn't already synthesized into knowledge?
[00:40:14] Craig McLuckie:
Yeah. So, I mean, the way I think about it and this is something that, you know, comes up in every conversation I have. Like, there's four constituent pieces that you need when you want to basically render a like, a I I love the context engineering narrative. Like, it's we call it just an MCP platform, but it's a it's a context engineering platform. It it's a platform that enables people to start building these next gen AI enabled apps. Part one, secure runtime. And when I say secure runtime, I mean, like, hey. Fetch server's great. You know, a lot of people just install a fetch server and you can access them. Do you really want one of these agents having unfettered access to every resource on your corporate Internet that isn't behind the north system, or do you want them to be able to access that using the user's auth? Like, probably not. So being able to just, you know, describe a runtime environment into which you can deploy, you know, a server with all of its associated tools and then, you know, start to to reason about that is is is pretty important. And so, it always starts with that secure runtime. The second piece of it is, you know, this gateways are dime a dozen, but you do need a gateway technology. Right? Like, an like, the way I think about this is you need a gateway not only as a way to like, a lot of gateway technologies are built around this ideal of, like, hey. Run all your like, the it's like an API gateway. Like, just run all of your service through this API gateway, and you can now start to, you know, assert security, and you can deal with authentication, authorization, and you can have some basic policy assertion. I think that's the wrong path. Right? Like, what you really want is an aggregation endpoint that you can start to map and and recontextualize.
So, like, when this tool is calling this system with this thing, like, these are the tool descriptions that map those those kind of atomic units that are gonna produce the best work. Here's an endpoint where I can start to apply semantic search. Right? Like, you know, like, hey. I've been using a a frontier model. Let's find a tool selection, but it's killing me on inferencing costs. I wanna kind of thunk down to a low model, but I don't wanna have to rematch my tools. I need semantic search. So that that gateway then becomes a single endpoint with semantic search, the ability to start, you know, driving a lot of the tuning and and composition narrative for specific workflows or specific users or specific groups, you know, based on on, you know, some other configuration setting. And then the third piece of it is is orchestration. You need to be able to start snapping these things together. You know, pushing orchestration into the model is very expensive, and it creates the possibility of of entropy cascade where where you're asking something that's stochastic to do a set of relatively precise sequential tasks. You're either going to have a lot of redo, which generates token wastage because it's just trying and trying and trying, or you're gonna get in unpredictable results. So so the ability to start describing structured orchestration and then transactional behavior. Like, hey. I want this thing to do this action, but I I I really can't afford it to you know, like, I'm I'm I'm issuing a refund. I really can't afford it to, you know, mark this thing as issued a refund and then, you know, do a financial disbursement. Like, those two things have to happen concurrently, and they're they're two different systems. Right? Like, so you you need that you need that that that sort of capability. So, you know, I think of it as going beyond a gateway to kind of virtual MCP construct. There's it's it's a way to start optimizing and and rendering tools. And there's a lot of other things you can do in that space. Like, you know, we're working on things like context compression, you know, other ways to sort sort of do context augmentation. There's a lot of other systems that belong in that in that layer. The third layer is, the registry. Right? So, you know, this is this is you know, we've been working very closely with option communities, particularly around the registry specification, the registry implementation.
But, you know, as you start to have MCP as a as a protocol, being able to start to describe, you know, these are the set of services that are available to your organization. Some of them can be run locally in a certain environment. Some of them can be run, oh, as a sort of heroic hosted system. There's life cycle associated with that, and some of them are proxy because, hey. There's some great SaaS based MCP servers, but I want them all aggregated, and I need to be able to describe that. I need to be able to make sure that I let my organization know what can and can't be consumed and what's available right now. So the registry is an important part of that. And then the last piece of it is is, you know, like, you have to generate a a console. Right? Like, you know, hey. I'm a developer. I want these servers. Like, how where do I go to to actually start, you know, setting up that that that that that surface area? Where do I go to start kind of tuning it from my workflow? You know, some of that will be driven by UI, but over time, most of that will just be driven through through a MetaMCP server. Right? Like, where, you know, like, that just happens. You know? Hey. I need a server to do something. Can you find me one? Sure. Here it is. We we can go kinda go from there. So those are the constituent pieces that that we see a lot of demand for. That's what we've built with, tool hat.
[00:44:25] Tobias Macey:
Another interesting aspect of the tool equation is in particular when you're using these AI agents in a local context, particularly for software engineers, whereas things like Cloud Code, Copilot CLI, Gemini CLI are the forefront in terms of usage. And they do have capacity to interact with MCP, but they also have your entire suite of command line tools available as well as ad hoc and arbitrary tools to be able to bring in context, perform different actions. And I'm wondering how you think about that juxtaposition of MCP as a protocol for either curated or remote actions versus just having a suite of, hey. You can do whatever you want and figure it out as you go
[00:45:15] Craig McLuckie:
as a means of kind of tool use. You know, it's interesting because, like, you know, I always joke, you know, like, with developers, it's always safety third. Right? Like, it's it's it's very hard to sell a developer on a safety first mindset because, like, we're really you know, developers love building. Right? Like, they don't love locking down and and hardening and all sort of stuff. And you can certainly make good inroads just, you know, you know, in an interactive model with with kind of bash and whatnot. But I think there's there's there's kind of these you know, there's a set of environments where this becomes problematic. Like, one is I mean, like, how often have you had, you know, one of these things, RMRF your file system because you weren't kinda paying attention. Right? And, you know, the the there are like I said before, like, the the security stuff, people are getting shut down because, you know, at the end of the day, the cost of a security breach is dramatic. The embarrassment associated with this is dramatic. You know, like, you know, people are starting to you know, at least the people I speak to are are getting authentically gun shy about, you know, relatively unfettered, you know, kind of root level access to whatever resources you want. Because, you know, while you might have it in your head that you can watch this thing and make sure it doesn't do anything stupid, you're human. Like, that's a low dopamine task. You're just gonna tune up, and sooner or later, it's gonna do something that's that's gonna that you're gonna regret. So I think the, you know, kind of helping recognize that, but but the you have to provide an alternative that is just as useful. I think that's the the key thing. Right? Like, it you know, for developers to actually embrace this stuff, the alternative needs to be as as delightful and as capable. Right? So it's it's you you need to have a relatively robust of tools. You need to, you know, start observing the set of things that they want and actually start to reason about delivering the tools. And this is something that is wonderful about open source. Like, this ecosystem will start to you know, like, the the best way to tap into a lot of people is to start actually having an ecosystem emerge and and start to formalize it. But there's also knowledge workers. Right? Like, you know, there's a lot of citizen developers out there, and god help the world when they start to you know, they they, like, you know, like, a lot of organizations' cyber programs are built around the idea that, like, developers go through training. And now you have citizen developers that are are basically, oh, I know how to do all this stuff, and that gets kind of even even more more more dangerous.
So I think, you know, one is this will be naturally correcting as, as as people start to just recognize that pace of exploitation is outpacing pace of spend. And, you know, the cost of exploitation is becoming a line item that you need to worry about, as much as you're worrying about the actual spend on the systems. And then I think a lot of it's also just it's nascent. Like, we haven't fully shown what this can do. Right? Like, it's it's a it's, you know, like, it's a curiosity. But until organizations or people, individuals start sitting down and asking hard questions around, like, what are the actual nouns and verbs that dominate my life? How do I start to generate tools? And I'm not saying I'm I'm gonna write the tools. Like, it's entirely reasonable to have an agent start to build those tools based on the known set of nouns and verbs. And that code can then be reasoned about hardened, isolated, secured, and published and and reused inside an organization.
So that's that's the sort of you know, like, I I I do agree. I think most developers aren't attracted to, you know, anything that's gonna kind of, you know, inhibit their ability to vibe code their way into productivity. But I do think that there are options that will initially satisfy but eventually delight developers, give them tools that they can use to kind of radically improve their life, but also
[00:48:28] Tobias Macey:
not get hurt. In that context of security, to your point, developers always put it last because we'll figure it out later. In terms of what you're doing with tool with ToolHive, obviously, there is authentication at the access layer, but I think another interesting piece that you're bringing in is that question of how do we actually expose credentials to the MCP servers themselves because that's also another piece that in many of the MCP servers is left as an exercise to the reader. And I'm wondering if you could just talk to some of the ways that you're thinking about the overall ecosystem of MCP.
[00:49:03] Craig McLuckie:
Yeah. I mean, it's like this is where we already started the journey. It's like like, look. I'm a grown up. I wanna use these things. I'm not gonna have API key written in plain text on my system where an agent can read it. Like, it's just not gonna happen. Right? And so, you know, being able to start using you know, like, you know, we started with local you know, kind of serving people with a local system. Being able to use, you know, one of the the locally integrated, you know, kind of keychains, like, you know, like, one password or the Apple integrated, you know, system. So you write your tokens in there, and it just gets mapped into the files by system. Made it made a ton of sense. And so, you know, we we certainly started with that, but the the place where it gets really interesting is on the server side. Right? So when you're starting to look at, you know, hey. I've got a set of clients, and they wanna start connecting to the server. The MCP specification is pretty good at the client server auth flow.
But the exercise of saying, here's a set of patterns that enable you to, you know, reason about, you know, server to back end server auth is is it's it's it's an exercise to the reader. Right? So, you know, I know, like, friends at Okta have been doing, you know, quite a lot of work in space. The entry people are doing a lot of work in space. There are conversations happening in the community, but you have to have a reference implementation. Right? So so what we basically, you know, did is said, what are the four or five common patterns that people have? Like, you know, hey. You wanna use an OAuth, you know, kind of credential? Like, how do you do a token exchange so that you can, you know, move from, you know, a a token that is, you know, scoped for everything a user can do to a a token that's scoped to a specific task? And, like, what does that look like in a canonical way? And how do you implement that without having to redo this for every EMCP server that you're building?
Hey. I I I wanna use a durable credential to go access a database. You know, how do I inject that conditionally, and how do I set policy around that? So we've we've put a lot of time and thought into the authentication and and and authorization space, but we're also, you know, just working with a lot of partners. Right? Like, it's like, hey. You're not gonna want to deploy a new identity provider. That's kinda you know, there's a reason why, you know, perimeterless security and zero trust, there's been a lot of, you know, adoption difficulties in enterprise because enterprises are living with what they have. It's it's difficult to you know, unless you're entirely greenfield to to adopt these types of patterns. So a lot of what we've been trying to do is just figure out how to make it work with what you have, but recognizing that there has to be a space for something that's you know, like, you know, agents have to have their own identities.
It's not an OIDC credential. It's not a service account. It's it's a delegated credential that has a certain set of attributes. You can kinda make it fit into some of these systems. But, eventually, we'll figure out what that looks like, and, and and we'll make sure that, you know, we can kinda support that over time. And then the other place we look at is just, you know, this should be obvious from a security perspective. You need to be able to do, you know, kind of, malicious prompt detection. You'd like you wanna make sure that you do PI data reduction. There's a bunch of things. We've made sure that the system is is kind of sufficiently principled so you can just snap in whatever your your favorite is. And over time, tool hive will come with a lot of batteries included options for for those pieces as well.
[00:52:01] Tobias Macey:
And as you have been coming to grips with the current state of the ecosystem, understanding the usage patterns and the opportunities and building out Tool Hive to get in front of people, what are some of the most interesting or innovative or unexpected ways that you're seeing that toolkit used or just MCP more broadly?
[00:52:19] Craig McLuckie:
You know, it's interesting. Like, it's this this caught me off guard. We we did this, and we've worked with other people that have done the same thing. It was kind of a silly thing. You know, we we wanted to build a knowledge server. And, you know, basically, what I'm saying by that is, like, hey. A lot of people are buying green or they're buying these, you know, these kind of first class systems that they integrate with all their systems. And I I set my team an exercise of, like, what would it take to just build one of our own? You know? Like, what would this look like? And, it turns out it was actually surprisingly easy. You know? We just build an agent that that runs it. We go and take all of the content. We chunk it. We had to do a bunch of tuning because, you know, like, you know, the the quanta when you're chunking Discord messages versus, you know, kind of Google Docs to make sure that the semantic, you know, relevant showing through was, was surprising. But having MCP based knowledge server endpoint, which is called the knowledge server, and you can integrate into all your tools.
And you can ask it questions, and it can basically see, you know, everything that's within your, you know, the kind of public domain. It was a complete game changer. It's it's just shockingly useful. Like, it's just wonderful to be, oh, by the way, what is this developer working on? Or, oh, by the way, what are the top Has this been mentioned in Discord? I haven't seen anything on it. Oh, yes. It has. Right? So so that's just sort of immediate, like, unlock of institutional knowledge through a relatively simple to implement, you know, service. So we built this thing in a few weeks, and it's just shockingly useful. Right? Like, it just, it changes how I work. It's it's changed how everyone in the company works. And as we started to talk to other people, that that becomes like, you know, sort of an anchor. The ability to, you know, sort of redefine how, you know, knowledge server and expose this as an endpoint. And now, you know, like, these things suddenly you know, they just have a bit of rich rich access to kind of precise semantics.
I I just didn't expect that. Like, I you know, I guess it's one of these things. I thought it would be a huge Herculean lift. It turns out that a couple engineers knocked out something just damn useful in a couple of weeks, and it blew my mind.
[00:54:09] Tobias Macey:
And as you have been building in this space and growing a company around tool hive and this overall space of empowering teams to build these AI native applications, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:28] Craig McLuckie:
I think the biggest challenge that organizations face is and and, I mean, and this should be obvious, so I think anyone who's worked in the space for a while, but it's not enough to just start to expose tools. Right? Like, you you have to really think like, I've seen this with a lot of folks where, you know, like, they're like, okay. We're building, you know, a system. Like and we we like, here's the problem. Like, Cursor is the same dumb intern every day. Right? Like, it's like they they're incredibly enthusiastic. They're incredibly prolific, but they know nothing. Right? And, like, you know, for an enterprise that's that wants to work, like, they can't they just can't NPX install these random tools. They just can't use those integrations.
And so they start, you know, kind of working on the assumption around, like, okay. Well, if we just created a server that made this doc available, that doc is the is the, is the specification. If we just did this and we've just did that, you know, everything would kind of work. And, you know, what's happening is this is sort of they're seeing kind of incrementalism, but they're not necessarily seeing a c a c change. Right? Like, it's you know, you you kind of have to start rethinking the SDLC, and you have to rethink, like, where value engineering is happening. And I think the fact that organizations are bringing a relatively kind of traditional mindset to this and aren't taking the time to understand what problem are you solving, where are the boundaries. Like, let's run some very, very fast experiments.
Let's not assume that just, you know, kind of rendering all these endpoints as tools is is fundamentally gonna change the workflow. Like, let's actually, you know, start by trying to understand what we're trying to accomplish and work back from, like, well, what do we need the model to know, not just have access to? What do we need to do to start to, you know, contextualize this? Because it it's it's interesting to see how you get these kind of, like, you know, people work with these tools over time, and they they there's this weird thing that engineers do where they start to, like, have these sort of ceremonial set of activities. And what they're effectively doing is, you know, do this first because then it knows what packages. And now that it's effectively manually loading the context window with you know, by by performing things in a certain order, it's getting to the point where, you know, like, they they don't really even realize they're doing it, but they know that it works.
And so there's a sort of process of kind of implicit discovery of of how to render context, you know, in a way that actually could complete a task, but it's not being written down or formalized or reasoned about in a first class way. And so I think, you know, you kinda have to start with an experimental mindset where you're just like, look. Let's just get these things let's just put in a in a controlled environment, get them working, try it, see what happens. You know? Play, play, play, play, play. And then also the question, okay. What like, what's working? But why is it working now? And what can we start to do to kind of work back from that to actually start describing the tools that will will get us into that state.
[00:57:08] Tobias Macey:
And for people who are interested in expanding their investment in MCP as a core capability, what are the situations where you would advise against using ToolHive specifically?
[00:57:20] Craig McLuckie:
That's a great question. I would say if you, like, if you don't know what MCP is, don't use tool hive. Right? Like, if you if you if you're not at a point where like, if you if you don't if you don't know what problem you're trying to solve is, like, I can't help you. Right? Like, it's, you know, it it this this is a system that that's really you know, it it it's it's it's solving two problems. Right? Like, it's solving the problem of of control for for a real world, you know, real world company, and it's solving the problem of scale, like when you wanna go from one to 10 to a 100 to a thousand to 10,000, you know, servers. It's it's a it's a good system for that. If you haven't yet tried to just use the SaaS based endpoints and experiment, if you're not convinced that there's something there, you know, tool hat's probably not the right starting point. Right? Like, it's you know, this is this is a tool that is, you know, it's really for teams that are invested in building cloud native applications that want to, unlock first party data in addition to other, you know, sort of data. It requires a a a fair bit of operational lift. This is not a a a like a in a sort of an incredibly simple turnkey, just turn around and go kind of system. This is like, no. I'm serious about this stuff, and I need a platform to do it. And I I wanna operationalize it, and I have to reason about authentication and authorization. And, yeah, I know I wanna do this. Maybe I haven't figured out perfectly, but at least know what MSP does now. I know what I want to start exposing. I have a theory around what my first projects are. ToolLab's a great option. But, you know, I would definitely start by hey. Just turn on the the the integrations in, you know, like, all of these all of these systems have native integrations. There's a lot of SaaS based endpoints out there. Start there. You know? See how far that gets you. If you're still like, if you see the value, but you don't know how to fully unlock it, if you get to the point where you like it, but you can't use it because you're an enterprise. And if you're interested in, like, okay. How do I go beyond that? And I actually start I wanna start building, you know, in earnest, and I wanna start really reasoning about how to not just, you know, provide a tool, but deliver a tool that works for this agent all the time that has this this kind of rigorous evil framework that has this this kind of tuned behavior that, you know, I've got a lot of tools I need to start, you know, reasoning about context pollution. Hey. I want to start driving, you know, kind of context compression. There's a variety of these things I wanna do. Come talk to us.
[00:59:28] Tobias Macey:
And as you continue to build and invest in and grow tool hive and expand the overall set of offerings from Stack Lock, what are some of the things you have planned for the near to medium term or any particular problem areas you're excited to explore?
[00:59:42] Craig McLuckie:
Yeah. There's a lot of things that we're interested in. I think, you know, we've we've got the core of the platform out there. By the way, it's an Apache two open source project. So it's it's Stack Lock, but it's also Red Hat and a variety of other industry players are starting to participate in it. We we we've been very happy and excited to work with Red Hat, I should mention, in here. They put a fair number of of engineers, and we actually have maintainers from Red Hat on it. So, you know, Red Hat will have their own theory around where this this run time goes. But for us, you know, it's it's about, you know, kind of hardening the platform, like, you know, getting to the point where it has you know, like, I'm I'm a Kubernetes guy. Right? So, like, when I say platform, I have very specific I have specific taste around what a platform should look like. Right? So, you know, common API based interface, you know, well factored, well, you know, well structured, you know, common naming convention. So I'd say, you know, thing one is, you know, we're working very hard on the platformification of the system So it looks and feels like an actual context platform or a context engineering platform or an MCP platform, whatever you wanna call it. The second thing is, you know, obviously, we wanna continue to enrich the capabilities at every level. So right now, you know, secure runtime, you know, we've basically built out the, the proxy layer, the the gateway layer. We've put a lot of the the the VMC pieces peep the VMCP, the virtual MCP constructs in there. You know, we'll continue to put a lot of effort into that. So, you know, enriching the orchestration systems, you know, enriching the ability to support transactional semantics, layering in and making sure that the pipeline architecture supports capabilities like PI redaction, you know, kind of turnkey, you know, prompted resource security. So there's a lot there's a lot of work that needs to happen in that space. You know, the registry is we were I think we're the first people to have an upstream compatible registry because, we actually wrote most of the upstream or not most of it, but wrote a decent portion of the upstream registry implementation for for the, you know, for the recent release. We certainly contributed a lot, you know, as as that that that piece was out there. So, you know, the registry is there, but we'll continue to harden it. And there'll be a lot of enterprise one on ones associated with that. You know, we've we've we've worked with the community to make sure that there's, you know, kind of extensibility pieces in there so you can start to get attestation. And, you know, if you need to have signed images and you wanna start reasoning about this thing flowing through, you know, a more rigorous SDLC, you know, those pieces are in place, so we'll continue to invest in that. But the place where I'm I'm most passionate about is really the, you know, what we call the cloud console. Right? Like, the the ability to provide that interface where, you know, like, right now, hey. You know, just, you know, wire in the VMTP thing, and then people just you know, platform engineers will just deliver tools, and you can use them. But we need to create that self-service experience so that for developers and citizen coders and for people that are you know, I'd say citizen code. Like, everyone's gonna be a citizen code at some point. Right? How do you actually start to describe your work in a way that's more natural? How do you actually start to participate in defining that interface in a in a a much more kind of natural way, both through a kind of, you know, AI native experience so that you can interact with the the the model and it will actually support a lot of that work for you, but also potentially, you know, having the right visualization. So I wanna make that system an AI native application itself. Right? So it's consumable through an MCP endpoint. But then as you start to have to have, like, web app you know, like, WebMCP type, you know, modularity so you can actually start to to, have a a a really great experience when interacting with it, that that's gonna be a big part of our focus for Nexa.
[01:02:45] Tobias Macey:
And one thing too that I saw just recently from the folks at Century is they're working on actually making their MCP server be its own little agent where rather than exposing all of the tools, it has an agent that takes the request and then determines which of the tools it knows about to then expose to your agent. I'm wondering what your thoughts are on some of that pattern as well.
[01:03:09] Craig McLuckie:
Yeah. I mean, we we like that pattern. It's, you know, it's it's the sort of you know, we started with just relatively simple semantic search. The the implementation that we have, you know, this this is really just to deal with the tool pollution problem. But there is there is this other piece which is very interesting, and, you know, we certainly spend a lot of time talking about it, which is, like, what is the, like, what is the like, how do you acquire privileges for a task? Right? Like, because a lot of these agentic tasks, you don't know what resources they have to act a priority. Right? Actually, you don't know what they're actually gonna do. You don't necessarily wanna give them unfettered access to everything. So there's a human in the loop process initially, but, eventually, there will be kind of agents watching agents, you know, and and and reasoning about a lot of these, these class of decisions. I don't know where it's gonna go. Like, I mean, it's like like I said earlier in this conversation, like, regulation and policy trails the state of the art by a pretty large amount. So I think, you know, for a lot of systems, you know, there will necessarily be, you know, some for some class of work, there will have to be human supervision for a while. But I I I do like that pattern, and I think it's very interesting. And, you know, we'll we'll you know, as as the as the the kind of, auth authentication authorization system start to kind of snap into focus, we will certainly be looking at some of those patterns.
[01:04:20] Tobias Macey:
Are there any other aspects of the work that you're doing on tool hive, at Stack Lock, anything about the MCP ecosystem more broadly that we didn't discuss yet that you would like to cover before we close out the show? We've covered a lot of ground. This is really great. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I would just like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
[01:04:51] Craig McLuckie:
Oh, it's it's it's skills. Right? Like, it's it's just the the the talent necessary to kind of take this you know, this it's the mindset shift where it's not about what the agent has access to. It's about what the agent knows. Like, it's how it's contextualized. It it takes time. Like, people will build that expertise, but it's expensive to do so. So I think the biggest gap there is is skills. And, like, you know, like, you know, we we we we we are, you know, we provide for deployed engineering capabilities, and that's the the best way to kind of, you know, translate the skills. And so I really think of it as a bootstrapping problem. Right? Because you need to be able to transfer skills into an organization. So we sell we we, like, we don't sell the product without, you know, FDE support because we know that's what's required to get people, to to succeed. But, you know, it's it's just bootstrapping. You need to do that enough to to get to the point where the organization generates its own awareness, and then you need to start providing the agents that actually are able to perform those tasks. But it's it's a it's a it's a cold start problem. It's a bootstrapping problem. Right? Like, you can't deploy the agents until the organization has the capabilities to run them. You you can't necessarily describe the agents until you spend enough time working with the organization to figure out what they're trying to do. And so it it all comes back to that, activation energy problem where you just have to have someone on call that knows what this looks like that's that's done this before because it will reduce your time to outcome by orders of magnitude. Like, it's not hard to go from nothing to a running version of tool hive with a well formed set of MCP servers in place.
But getting to the point where those are actually tuned to workflow and are generating results and are reducing token consumption and are improving agentic task completion requires, like, a a different way of thinking. And so that's that's that's always the hardest problem. It's just that zero to one. The platform's easy. We you know, we'll we'll we'll deliver that, but it's how you use it that requires this kind of, you know, sort of skills that organizations are struggling with right now.
[01:06:41] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing on tool hive and in standardizing and simplifying some of this overall ecosystem of MCP and context engineering for people who are trying to build these AI native applications is definitely a very important set of capabilities, and I appreciate all of the time and energy you're putting into that. And I hope you enjoy the rest of your day.
[01:07:07] Craig McLuckie:
Likewise. Thank you so much.
[01:07:13] Tobias Macey:
Thank you for listening. Don't forget to check out our other shows. The data engineering podcast covers the latest on modern data management, init covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@AIengineeringpodcast.com with your story.
Introducing Craig McLuckie and Stack Lock
From AI denialist to AI maximalist
Rethinking software creation and open source in the AI epoch
MCP as the API for AI‑native applications
Multimodality and the model as the new view layer
Enterprise paths to MCP: from dazzling demos to production
Security realities and the risks of NPX‑installed servers
Tool pollution, token costs, and selective activation
Designing for real workflows: nouns, verbs, and recruiters
ToolHive: composite servers and smarter tool selection
Semantic search, evals, and making smaller models excel
Context real estate and domain semantics
Structured orchestration and transactional guarantees
From unit tests to continuous evals: SRE for AI systems
Models, large action models, and future‑proofing with MCP
What a context engineering platform needs
CLI freedom vs. curated MCP: balancing safety and delight
Credentials, delegated identity, and policy in MCP
Surprising wins: building an MCP knowledge server
Hard lessons: beyond exposing endpoints to real value
When ToolHive is not the right fit
Roadmap: platform hardening, registry, and AI‑native console
Privilege acquisition and agents selecting tools
Closing thoughts and biggest gaps: skills and bootstrapping