Kubernetes, Compliance, and Control: The Operational Backbone of AI Sovereignty

This show is your guidebook to building scalable and maintainable AI systems. You will learn how to architect AI applications, apply AI to your work, and the considerations involved in building or customizing new models. Everything that you need to know to deliver real impact and value with machine learning and artificial intelligence.

Support the show!

25 February 2026

Kubernetes, Compliance, and Control: The Operational Backbone of AI Sovereignty - E78

0:00/0:00

Share on social media:

Description
Transcript
Chapters

Summary
In this episode of the AI Engineering Podcast, Steven Watt, leader of the Office of the CTO at Red Hat, discusses practical paths to achieving AI sovereignty for organizations. He shares his two-decade experience in AI, highlighting how governments are building GPU platforms and protected data hubs to maintain control over AI workloads. Steve emphasizes why self-managed infrastructure is becoming a strategic necessity as companies outgrow cloud costs and require tighter control over models, data, and compliance. The conversation explores the operational substrate for AI sovereignty, including Kubernetes as the scale-out backbone for LLM serving, bridging the gap with PyTorch ecosystems, observability and policy for non-deterministic systems, and emerging security needs such as confidential inference and agentic identity. They also discuss model and hardware optionality (GPUs, CPUs, and new accelerators), the growing demand for energy-efficient inference, and the importance of open models and post-training to create durable differentiation. Steve identifies access to GPUs as the biggest gap hindering sovereign AI adoption today, emphasizing the need for broad access to GPUs for AI workloads to thrive. The conversation also touches on evolving architectures beyond transformers, the interplay between AI and data sovereignty, consolidation pressures from pilot chaos to standardized platforms, and the societal triad of universities, startups, and sovereign infrastructure.

Announcements

Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
Unlock the full potential of your AI workloads with a seamless and composable data infrastructure. Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most - building intelligent systems. Write Python code for your business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. With native support for ML/AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions. Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch. Build end-to-end AI workflows that integrate seamlessly with your existing tech stack. Join the ranks of forward-thinking organizations that are revolutionizing their data engineering with Bruin. Get started today at aiengineeringpodcast.com/bruin, and for dbt Cloud customers, enjoy a $1,000 credit to migrate to Bruin Cloud.
Your host is Tobias Macey and today I'm interviewing Stephen Watt about how to adapt your existing infrastructure investments to support your AI workloads and gain "AI Sovereignty"

Interview

Introduction
How did you get involved in machine learning?
Can you describe what you mean by the term "AI sovereignty"?
What are the motivating factors for investing in that as an organizational capability?
What do you see as the scale, sophistication, regulatory triggers that tip someone from buying off-the-shelf AI services and into operating their own AI stacks?
There has been substantial investment in MLOps toolchains and patterns over the past decade, along with corresponding evolution of LLMOps techniques. What do you see as the areas of overlap between those technology patterns and the "traditional" infrastructure capabilities that organizations have matured over the past ~20 years?
What are the aspects that are disjoint and contribute to operational pain for DevOps/platform teams?
How do AI/agentic workloads strain the ability of existing security and governance frameworks that teams are operating for existing cloud-native workloads?
What are the options for extending those frameworks and what are the requirements that force a new approach? (e.g. guardrails, LLM interpretability, etc.)
What are the elements of cloud-native architecture that have left us (as an industry) well situated to absorb the complexity of AI/agentic workloads?
How does the complexity shift as you go along the continuum of model training to finetuning to inference?
Beyond the ability to host and execute inference on a model are the various data stores and tool availability that make generative AI a competitive advantage. How much of that (e.g. agentic memory, vector stores, MCP/A2A tools, etc.) are actually net new vs. a new coat of paint on existing techniques?
What are the most interesting, innovative, or unexpected ways that you have seen teams operationalizing AI workloads on their infrastructure?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on empowering organizations to achieve AI sovereignty?
When is operating your own AI infrastructure the wrong choice?
What are your predictions for the future evolution of operational substrates for AI workloads?

Contact Info

Parting Question

From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers.

Links

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Welcome and episode setup