Summary
In this episode of the AI Engineering podcast Anush Elangovan, VP of AI software at AMD, discusses the strategic integration of software and hardware at AMD. He emphasizes the open-source nature of their software, fostering innovation and collaboration in the AI ecosystem, and highlights AMD's performance and capability advantages over competitors like NVIDIA. Anush addresses challenges and opportunities in AI development, including quantization, model efficiency, and future deployment across various platforms, while also stressing the importance of open standards and flexible solutions that support efficient CPU-GPU communication and diverse AI workloads.
Announcements
Parting Question
In this episode of the AI Engineering podcast Anush Elangovan, VP of AI software at AMD, discusses the strategic integration of software and hardware at AMD. He emphasizes the open-source nature of their software, fostering innovation and collaboration in the AI ecosystem, and highlights AMD's performance and capability advantages over competitors like NVIDIA. Anush addresses challenges and opportunities in AI development, including quantization, model efficiency, and future deployment across various platforms, while also stressing the importance of open standards and flexible solutions that support efficient CPU-GPU communication and diverse AI workloads.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Your host is Tobias Macey and today I'm interviewing Anush Elangovan about AMD's work to expand the playing field for AI training and inference
- Introduction
- How did you get involved in machine learning?
- Can you describe what your work at AMD is focused on?
- A lot of the current attention on hardware for AI training and inference is focused on the raw GPU hardware. What is the role of the software stack in enabling and differentiating that underlying compute?
- CUDA has gained a significant amount of attention and adoption in the numeric computation space (AI, ML, scientific computing, etc.). What are the elements of platform risk associated with relying on CUDA as a developer or organization?
- The ROCm stack is the key element in AMD's AI and HPC strategy. What are the elements that comprise that ecosystem?
- What are the incentives for anyone outside of AMD to contribute to the ROCm project?
- How would you characterize the current competitive landscape for AMD across the AI/ML lifecycle stages? (pre-training, post-training, inference, fine-tuning)
- For teams who are focused on inference compute for model serving, what do they need to know/care about in regards to AMD hardware and the ROCm stack?
- What are the most interesting, innovative, or unexpected ways that you have seen AMD/ROCm used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AMD's AI software ecosystem?
- When is AMD/ROCm the wrong choice?
- What do you have planned for the future of ROCm?
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- ImageNet
- AMD
- ROCm
- CUDA
- HuggingFace
- Llama 3
- Llama 4
- Qwen
- DeepSeek R1
- MI300X
- Nokia Symbian
- UALink Standard
- Quantization
- HIPIFY
- ROCm Triton
- AMD Strix Halo
- AMD Epyc
- Liquid Networks
- MAMBA Architecture
- Transformer Architecture
- NPU == Neural Processing Unit
- llama.cpp
- Ollama
- Perplexity Score
- NUMA == Non-Uniform Memory Access
- vLLM
- SGLang
[00:00:05]
Tobias Macey:
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today I'm interviewing Anush Ilangavan about AMD's work to expand the playing field for AI training and inference. So, Anush, can you start by introducing yourself?
[00:00:30] Anush Elangovan:
Hey, Tobias. This is Anush. I lead the software team at AMD and for GPU software. Excited to be here.
[00:00:37] Tobias Macey:
And do you remember how you first got started working in ML and AI?
[00:00:42] Anush Elangovan:
Yeah. It's about, ten years ago. Actually, more than ten years ago. We were building smart devices, for IoT, and, and we were building technology that, you know, is this the early days of ImageNet and and things like that. Right? So we were trying to do a gesture recognition with, the earliest, early AI concepts. And then one thing led to another, and we started building ML compilers about, seven years ago. And then about two years ago, AMD saw the, the work we're doing in ML compilers and how we were able to target any ML workloads on any AI accelerator. And, they acquired, Nord dot AI, which I had founded with, with a strong team.
And we became a core part of the, AMD software strategy.
[00:01:35] Tobias Macey:
And so in terms of your work at AMD, obviously, you're very very focused on the software layer. But for a very hardware focused company, I'm wondering what are the elements of that strategy and how they factor into the competitive advantage that you're able to gain by virtue of focusing on the software elements of a very hardware focused problem?
[00:01:58] Anush Elangovan:
Yeah. I I think, AMD has traditionally, you know, done a very good job with hardware. They have a pervasive AI strategy that that goes from embedded laptops gaming PCs, all the way to the data center. And they have a very good cadence of execution of the this hardware. Right? Like, they they're just, like, very relentless in in building good hardware. And about two, three years ago, when, the AI boom was taken off, You know, there was a a focus to bring all of that hardware together with a, software strategy that allowed for, like, a software platform to be built on top of all of this great hardware innovation.
And so that's part of what, you know, our team does, which is build out software and ship software, like a software company, but be part of the AMD, like, innovation engine that builds all of this, awesome hardware. So when you pair that awesome hardware with great software, then it really unlocks, you know, great business value for customers. And and more importantly, it moves the industry forward with respect to, you know, the ecosystem it enables because AMD provides a very open environment for collaboration. And all our source code and software is open source, which allows for, innovation to go at the pace at which the people using our practitioners want to. Right? That is not, limited by the, the ability of what we we put out in closed source form.
[00:03:32] Tobias Macey:
And on the hardware side of the AI ecosystem, obviously, the competitive landscape has been largely dominated at least in terms of marketing by NVIDIA. The CUDA ecosystem has become an entrenched player and gained a lot of adoption because of the fact that it was one of the early movers in the space for numeric computation, which was largely focused on scientific computing and then got parlayed into its current position in the AI landscape. And I'm wondering what you see as the current state of affairs as far as where the attention is going versus the realities as far as the actual capabilities that are provided both at the hardware and in the software layers that are built on top of that. So thinking in terms of the competitive advantages between what CUDA has, it and its relation to NVIDIA hardware and the work that you're doing at AMD with the RACM stack?
[00:04:34] Anush Elangovan:
Very good question. And, just the the first way I'd answer that is capability wise and performance wise, we are, you know, on par or better. And the data to back that up is we we run, like, a million or two models from, from Hugging Face every night to make sure that all the, models run out of the box, you know, without any problems on AMD hardware. So that's that that's kind of pick a thing. And then, and then every model that has been released in the last year, like, llama three, llama four, deep sea, QEM three, everything worked out of the box with whatever ROC conversion we were shipping at the time. And more importantly, it was performant. So right now, if you go look for, like, what is the best way to serve DeepSeek, which is the most popular open source model, it is on MI 300 with Right? That that that is, you know, just performance. Not even no no TCO advantages, nothing else. Right? Just performance wise, it's better than what NVIDIA provides. And so maturity wise, I think we are at a point where it's it's, it's good. But there is also the perception part and and kind of, like, the, like you said, the entrenched, ecosystem part. Right? So if you look back in history, you can always, kind of think of similar transitions in in in technology. Right? You know, there was there was, like, Nokia and Symbian. Right? Like, that was, like, 98%. There was no way that anyone's gonna get anything else. Right? And it was a closed source operating system. It was obviously, there was a inflection point of, like, you know, when smartphones came out and, you know, then then we had the iPhone and we had Android and, you know, just kind of, like, overnight, it took over. But I wouldn't necessarily just say it's exactly like that, but it's a it's a process that once people realize that, hey. This just just works. It clicks and you're like, oh, why would I why would I be doing something in a closed ecosystem? Why would I be paying more for less performance? Right?
Those are all, like, factors that it just disconnects, and then then it's a, snowball effect of, like, okay. Now we just deploy with AMD. Right? And and and AMD has been through this journey with CPUs in 2017. I think 2017, 2018, you know, is just, like, 2% of the market share in data center CPU. Right now, it's, you know, like, it's almost I'd say, I I depending on what's, out there right now, it's almost half the market shares, like, with us. But, that was a journey too. Right? At that time, with 98% market share, the incumbent was, you know, like, there's no way it that anything could change. We think or I personally think the transition and the, adoption in, GPUs and AI software would be a lot faster than what you'd see in CPUs because AI is just about speed. It's about how fast you can move, how fast you can deploy, how fast you get the value for what you're using, and how fast the next technology shows up. And whatever you deployed is no longer useful. Right? So you need the ability to move fast, and that velocity gives you the ability to iterate faster.
And if we focus on our execution, it's, it's ours to, take.
[00:07:51] Tobias Macey:
And then another piece of that puzzle is the communication between the CPU and the GPU and some of the amount of offload that you're able to get onto the GPU as well as the level of efficiency that you're able to get for CPU compute for some of the less parallelizable elements of the model inference and model training. And I'm wondering what are some of the ways that the work that you're doing at ROC m works to optimize some of that communication and offload from CPU to GPU given that you're able to work both on the CPU and the GPU, given that it's all from the main same manufacturer versus the situation with Intel and NVIDIA chips where there are two different manufacturers, and so it requires more organizational communication?
[00:08:39] Anush Elangovan:
That's a very good question. The the good thing, like, philosophically, even though the silicon is from AMD, right, the CPU can be an, AMD CPU. The GPU can be an AMD GPU. Networking could also be from AMD. Right? We make what we call AI NICs. But the standards in which we connect them are open. Right? Like, AMD is part of the UALink standard, the UEC, consortium, the Ultra Ethernet consortium. So we want to make sure that we have the ability to build a reference that has all of these, like, cool integrations. But also, if someone were to build something better, they should be able to drop into that ecosystem and unlock the ecosystem, unlike in the competitive space where it's like, hey, this is the InfiniBand and only we do this. Or, you know, you can, you know, you have to play by these rules. Right? So that's, that's one, which is the ecosystem aspect of it. The second part of what you asked was important because having the ability to build the CPU, which we have built multiple generations of and GPU is now, you know, it's intensely popular. It gives us the insight into both CPU workloads, GPU workloads, and at what point do we transition from CPU to GPU. There's increasingly more focus on test and compute, things like, you know, where where you want to do some sort of post inference or, like, the thinking models that are out there now. Right?
Either that that allows for a a new kind of, scaling. And we are very well positioned for that because we have this ability to run some of these complex algorithms on the CPU, but we also have the ability to do the heavy lifting in the highly parallelizable, infrastructure that the GPU provides.
[00:10:16] Tobias Macey:
Another element of the hardware and software collaboration is in terms of the quantization that you might do for inference time on a model where maybe, natively, it wants to be floating point 16 bit, but you wanna quantize it down for better speed or the ability to run on less capable hardware. I'm wondering how that hardware element factors into the ways that you think about the software layer for either enabling the work of doing that quantization or being able to improve the efficiency of the compute maybe without having to quantize? Yeah.
[00:10:54] Anush Elangovan:
The way I would look at quantization is it almost gets to an art form because you have so many hyperparameters to tweak. Right? But you want to be able to preserve the quality of service of, like, you know, is your model outputting the same, thing? And and and some of it gets very subjective. Right? Like, I I've personally seen deployments where the customer is like, as long as the end user doesn't notice what's going on, you're good. You can quantize it. Right? So then we have some additional aggressive mechanisms to shrink from, like, you know, float 16 to float eight or even, lower. Right? Like, our next generation hardware will have FP four support, which is four bit support. The computation itself is in four bits. Right? And that increasingly gets, like, your your numerical range is just four bits. And and just a few years ago, we were doing everything in FP 32. From 32 to four bits is, like, the industry shifted drastically.
So we still have some of those, SLAs to deliver to and make sure that the model doesn't lose its accuracy. But but that's an interesting area of both research and and production. Right? Just an as an example, like, there was a time, pre LLMs, that int four was considered. Like, you know, int four is for for a time it was considered, it's gonna come. People put it in, in in hardware. I think a 100 had it or something, and then h 100 removed it because there's, like, there's no use for info. And then suddenly, info just became a thing. Right? And LLM showed up, and now info or FP four is, like, it's real. So there was a cycle of, like, oh, quantization below a a particular size is not gonna work out. And then similar to DeepSeek, where DeepSeek showed the first actual training with FP eight with all the tricks they did to get FP eight training to work. So it's a it's a it's a very exciting space and and one that we, at AMD, you know, heavily invested.
[00:12:54] Tobias Macey:
And then for the people who are in the process of designing their AI stack, they're figuring out what to invest in, what are the software components that they want to rely on. How does the fact that RockM is open source versus the proprietary nature of CUDA factor into the ways that they're thinking about platform risk and and the ability to own more of the end to end capabilities of the system?
[00:13:23] Anush Elangovan:
I I think the important thing about AI capabilities are unlocked in a a way that different parts of the system, like, factor into different parts of how what gets unlocked. Right? Like and and let me give you an example. So for example, AMD hardware has or Instinct hardware has more than 50% memory capacity and memory bandwidth. Right? And so what this allows us to do is that we can actually serve decode LLM tokens very fast because it's all bandwidth limited. But then there are cases where you have a first time to token, which is, like, how long does it take to react first? Right? So it's a it's a balance of, like, how do you get enough of this so that you don't degrade your SLA? But then you wanna run fast and generate, like, an entire story fast. But but that requires, like, balancing not just on a, not not just a computational optimization.
It could be, like, power. It could be various other end to end systems that we have to, you know, orchestrate. And the software to manage that gets very interesting and complicated, and we have to build that to be able to to unlock this level of, like, customization.
[00:14:38] Tobias Macey:
And then digging a bit more into the software software stack that you're building, RockM is the the key element of that, but there are also a few other layers available, such as the Hypify ability for being able to translate CUDA into more portable c plus plus. And I'm wondering if you can just give an overview about all of the components that exist in the RockM stack specifically and some of the surrounding ecosystem that builds on top of it.
[00:15:07] Anush Elangovan:
So we we we approach this with, like, a a multilayered approach. One is this. There are customers that have invested in, in in in, like, NVIDIA and down to P TECH level code. Right? They've invested very heavily. So for them, we have the SIPify tool, which takes their CUDA code and converts it to portable HIP code. And portable HIP code can, in turn, be compiled back onto NVIDIA code or systems or onto AMD systems. Right? It literally is making it portable. But then there's a layer on top of it, which is like Triton, which provides you the ability to, like, start from Python and then target AMD or or NVIDIA or other, GPU platforms.
And Triton provides you this, you know, easy to use abstracted GPU programming interface that allows for, like, very easy ramp up to get to, like, almost good performance and then let the compiler do the rest. So we invest both in the first layer, which is, hey. You got some CUDA code and PTAC code. We'll help you translate it. That's Hypify. But then if they if you're coming in on a new BlueField or Greenfield deployments, you kind of, come in and you have Triton as the way that you kind of have to implement it because then it gives you the portability. And it's a modern programming kernel programming language that you could use to, you know, target either of AMD or Nvidia.
[00:16:29] Tobias Macey:
And then one of the main reasons for releasing software as open source is to allow for other participants in the ecosystem to contribute back to it, take it, fork it, customize it. And AMD is definitely the main driving the main driving force behind Broc m given that it is targeting AMD hardware primarily. And I'm curious, what are the incentives that exist in the ecosystem for other people to want to contribute back to that and customize it?
[00:17:02] Anush Elangovan:
Yeah. Very good question. So what we've done in the past few months is actually make Rockham more accessible. It's not just open source by yeah. It's just like, hey. There's something out there, and we'll do whatever we want, and we'll update it and throw the source code over. So we truly are moving to an open development model. And so what this provides is the ability for anyone to take all of the source code and modify it and do what they want to. For example, AMD's, Strix Halo laptops are very popular laptops and desktops because, you know, you get 128 gigs of RAM that can be chain used between your CPU and GPU. And the Rotom builds were getting more and more robust with a solid CI and and the ability to build. And the community was able to add Windows support on, Rakim devices, which we were slowly working towards, but the community was able to pick up and accelerate that pace. And so now, as of yesterday, we have someone who just tweeted about, like, how they are running PyTorch on Windows without WSL on Rocker. Right? Like, that's that that that's the power of the open source ecosystem.
And we don't want to hold up innovation and speed. And that is why we think, you know, we'll we'll run the race faster as we speed up because it's just gonna be more people scratching the edge, doing the thing that they want to do, but then it lifts all boats, and then all boats move faster. Right?
[00:18:33] Tobias Macey:
Another aspect of the ecosystem for hardware for AI is that there are different load patterns, different requirements around the different stages of the life cycle from pretraining to posttraining to fine tuning to, just running it as inference. And for different consumers at different stages of that life cycle, they have a different relationship to the underlying hardware where as an end user who is just using an API for AI inference, I don't necessarily care what the actual hardware is under the covers there. I just care that I get some reasonable response times. Whereas if I'm somebody who is investing a lot into pretraining a substantial model, I'm going to be much more focused on what the hardware is, the software stack on top of it, etcetera. And then in the middle, if I'm a company that's running my own BLLM or SG Lang instance for being able to customize my inference, I'm going to maybe be somewhere in the middle where I've got a cloud provider, and I just say, just give me some GPUs. And I'm wondering how you think about the ways that you're interacting with people along those different stages of that life cycle.
[00:19:46] Anush Elangovan:
That's a very good question, and we you've, like, segregated the the market. The thing for AMD is, like, we plan to go through with every one of those two, those three, users. Right? So there'll be different routes to market, but but we'll get to the end user. And those end users would if they consume from an API, it's just gonna be, yeah, DeepSeeker one, the fastest and the most throughput is on MI 300 today. And so for the end user, they're just gonna see their tokens, you know, still per dollar or, you know, the cost per token to be low, and it's just value provided with that. Right? And and and so, similarly, we're building out how you do distributed inference so you can even get even for the economies of scale by disaggregating, you know, prefill, decode, etcetera.
But then on the other side, we also help, hey. You want GPU as a service, and then you need an entire stack on top. We provide the reference architecture of that stack so that it's not like, hey. We'll give you a GPU, and we don't know what to do on top of it. So you gotta take all the pieces, put them together, tie it in together, and then, you know, move it forward. And then the third one is obviously they it's on prem or they have their own clusters, and, you know, they they know exactly what layer of the software they need, and they'll replace something. They'll have their own Kubernetes deployment layer. They'll have their own training framework that they modified from LM or, you know, or JAX. You know, they're like the super sophisticated Frontier Labs. Right? And, I think that if you look at, like, the top 10 customers, those would be, you know, will fall in that bucket. And and a vast majority of them do have AMD deployments.
So it is, you know, it's a growing phase where I think we will you know, it's just like what we did with Epic. Right? Like, it it's just a matter of time before it's it's an option. And then the others would be TCO advantage by just saying, hey. Yeah. If I serve on MI 300, I get more performance and it costs me less. So I'm just gonna serve the request and MI 300 back end. So that's how, you know, I I look at it.
[00:21:53] Tobias Macey:
Another element to the software investment for powering particularly these very numeric and parallelizable workloads that are epitomized by these large language models is that there has been a lot of effort going into now trying to reduce the computational load for training and for inference either through different attention mechanisms or the work that's going on with mixture of experts and then also some of the newer architectural patterns where transformers are trying to move beyond transformers. I'm thinking in particular, Mamba or the diffusion based models or the work that, the liquid networks folks are doing. And I'm wondering how that feeds back into the ways that you're thinking about the software stack that is able to improve the efficiency or the capabilities of these different model architectures and these different compute patterns that get translated to the underlying hardware.
[00:22:52] Anush Elangovan:
Yeah. That's a very good question. The the way I, the way I I look at it is it's like, layered cheese. Right? So the the you you just have to invest in each layer to be robust and solid so that when there's a switch from transformers to Mamba or SSMs or, you know, other, architectures, it's not a question of if. It's a question of when. When that happens, we already have the underlying infrastructure and the and the core compute, you know, disaggregation and and how we network all of these compute. Those libraries and and infrastructure is robust enough for a switch at the top level, right, at the at the modeling level. But then on top, again, it'll there there'll be another level of, like, you know, bringing it back to an API. Right? Like, in the end, the last mile for the end user is gonna be an API. And that's gonna be, you know, something like a open API open AI, like, API for LLMs. But that can also change and evolve over time. But there's a fan out in between the very low levels of the software stack. And then as you get to the middle layer, there are paths of innovation that will, you know, we explore.
And I can tell you at AMD and, you know, from my view, all of those paths that you talked about, we are very well invested in, and and we bring, bring it up there. So I I think the the way I approach it is how do you be prepared to react, and how fast can you react? That's gonna be the currency for being relevant in this fast moving AI world.
[00:24:27] Tobias Macey:
And then the other potential target for an end user is for people who wanna be able to do their own local inference on a laptop using these open models, and they wanna be able to do it reasonably, affordably without having to run their own power plant to be able to power the cards. And I'm wondering how you're thinking about that consumer grade aspect of AI and the ways that people are able to run them on their local laptops, on their desktop machines, or on edge compute for being able to bring this into more of a personalized mode without having to necessarily be a AI researcher and understand all of the things that they need to know to be able to quantize the models and fine tune them to be able to run effectively on the silicon that they have and whatever device they might have ready to hand?
[00:25:18] Anush Elangovan:
Yeah. So AMD has a very pervasive AI story. Right? Like, we we could go all the way embedded, auto automotive, like, autopilot devices. And then laptops, like the Strix Halo laptop that I was talking about, it has, like, 50 tops of, compute on an NPU. So NPU, think of it as a when you need efficiency when you can trade efficiency to programmability, you go for an NPU. When you can trade when you want the programmability, you use an iGPU. But But then you also have a CPU, which is your fallback. Right? And all of those pieces of silicon are AMD built and in the laptop form factor. So in fact, I have my Styx Halo laptop over here, and it's got all of that tied together and an open source software stack that runs everything on that is open source. Right? So it's not like, hey. Install CUDA 5.8 something. Right? If you wanna build it, you can take it, build it, and tinker with it. Right? So so it gives you that ability, and we see incredible innovation with, you know, the likes of LAMA CPP and and, OLLAMA and and, you know, the client side of, inferencing, where they go in and kind of make these, big innovations and and especially in quantization.
Right? So we're going back to your original point, like, you know, Lava CPP was the first one to make, like, q eight, q six, q five, and you could change only a few tensors to be this, and your, perplexity looks okay. So it almost gave you, like, per tensor level fidelity to be like, okay. This model can now run on this, and then there's a a big lever for how much VRAM you have. So they map VRAM to this, and they find a a sweet spot of a model that can run locally. And and I think increasingly, we're gonna see even algorithmic innovations where you have, like, a active critic model or or a student teacher models. And I think if I remember right, LAMA four, Behemoth was also used in in a similar way, right, like, where you use the very large model to be the teacher, and then your small model distills and learns from the teacher. But it's a smaller model. So there are, like, innovations happening at each layer of the cheese. Right? It's at the hardware level, quantization level, at the, modeling level. And put all of them together, I think, increasingly, you'll see a lot of local AI.
[00:27:32] Tobias Macey:
And in the work that you're doing of building this software ecosystem on top of the core silicon capabilities, the work that you're doing to make it open and accessible and flexible. What are some of the most interesting or innovative or unexpected ways that you're seeing that combination of AMD hardware and the RockM ecosystem applied?
[00:27:54] Anush Elangovan:
Innovative ways that we're seeing that that we hadn't envisioned that people are are using it in. So the cool things I've seen, like, I've seen an AMD GPU, like, a normal PCIe card suck on, like, a RISC five CPU, and they built a little thing with with a, AMD GPU that wasn't even designed to run on a host with RISC five. But given it's open, they could they could actually port the driver and get it to work. And they actually just have a solid graphics card now running on a completely new CPU architecture because of the open, you know, like, architecture that we have open ecosystem that we have at at AMD. I've also seen some the, mention of, like, Windows portability. Right? Like, you know, the community did a lot of the work to get Windows portability off Rockham, and now we have, Rockum that just works on Windows. We do have the likes of, like, TinyGrad, where they are building a lot a lot of, AMD cards together to put together a cost efficient petaflop machine.
That's also very exciting. Yeah. It's a lot of, innovation at the application space with the local AI where you can run your, like, coding assistant on your laptop, and it's answering questions in real time. That's really cool to see. Yeah. You know, things like that are and and there's more coming every day, you know. You just have to keep up with.
[00:29:21] Tobias Macey:
And in terms of your own experience of working in this ecosystem, working very closely with the AMD hardware units and the end users and the software contributors? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:29:40] Anush Elangovan:
I think, as Rockham and AMD software has a a long history quick, and it has evolved, and so there's a perception hurdle that we had to get over. I I literally am on tech or Twitter listening for anyone complaining on, like, about AMD software or Rotom software. This is not working or that's not working. And I try to navigate and try to get get them the support that they need. So it requires a little bit of reeducating and asking folks that may have formed an opinion on AMD software to see where we are now. And that's always a little bit of a thing, right, because you've had an experience that was subpar at some point, but now you're asking for your second chance. Navigating that is interesting. And then just communicating what we are doing. Like, do do people know about Rockum? Unlikely. Right? Do people know what's the best GPU to serve DeepSeek on? Unlikely.
But that should be straightforward. It should be like, the fastest way to serve DeepSeek is on AMD. That's like a fact, and that fact is just about how we market that and get that out. Right? So, hopefully, we'll work on that and help people kind of, like, realize the value of what they can get with AMD hardware and software.
[00:30:52] Tobias Macey:
And the talking about the speed of inference too just made me think about another element of the serving piece is that oftentimes, you wanna be able to run multiple copies of the model if you're trying to do it for a number of different users simultaneously, which then brings in questions around being able to split the models across multiple GPU instances or being able to manage coordination of multiple GPU instances to pool together in a single batch of compute. And, obviously, AMD is being used in various super compute ecosystems. I'm wondering how you're thinking about that ease of use of being able to pool together multiple hardware units to be able to manage this inference serving or or for training.
[00:31:39] Anush Elangovan:
Very good question. But, you know, I would actually take it a level lower. AMD is the industry leader in chiplet design. Right? Our competition is gonna do that in a year or two, right, where they actually get chiplets. The MI 300 today and MI 250s and, you know, like for a few generations, we have built multiple chips that are in one big package. And so what that provides is we are already doing that at a chip level. Today's MI 300 has something called the CPX mode, which gives you eight levels of eight chips with 24 gigs of HBM each that you can actually partition as an independent GPU, and you could have normal locality to the CPU. It is actually, like, physically, you turn off the connections and you have a GPU. Right? And so what that provides is the ability to build from, like, fundamental building blocks, even at the SoC level. And then that gets up to the GPU level, and you have eight GPUs at a a UBB level. And then you get to a a multi node rack level, and then from a rack to a cluster level. And each of that kind of, like, inherently, if you build it from a first principles of how you do communication and compute, you start building a very, very robust, you know, software architecture. And there are challenges in how AMD had done it in the past that we are kind of, like, you know, have evolved over time. But today, going back to chiplet designs, you could actually partition each one of these GPUs, and we have instances of VLLM or SG LAN running on each one. So you have eight instances of VLLM serving Lama eight b, for example. And it's fully isolated. Right? So you get eight extra throughput just by parallelizing, you know, at, at the chip net level. So it's very exciting for computer scientists and for, folks that want to, like, just deal with operating system principles, like resources and allocation and and how you're efficiently using compute communication. Right? So AMD is built for that, which which is what I'd say. If you if you are in that space, we're obviously hiring. Drop me a note and blah blah blah. Well, it's it's a lot of fun activities to do and engineering to do with, with that.
[00:33:47] Tobias Macey:
And given all of the conversation we've had around the benefits of AMD hardware, the Rackham ecosystem, what are the cases where one or the other or both are the wrong choice?
[00:33:59] Anush Elangovan:
If there is a wrong choice, let me know. I'll make it the right choice. But, I I think I think there are there are cases where, you know I I wouldn't say see, the Rockum software should become invisible. It should just become magic. Right? It's just it should just be, hey. There's hardware resources, and there are APIs to do something. Right? And then you can do something on top. And what that something is is like VLLM serving. And then on top of it, you build something else. Right? So so Rotcombe as a interface should just disappear. Like, people should not have to think about it as a moat because it's open source, and it should not be thought about as, like, buggy because it doesn't exist in your it's just there. It's like ambient enablement of the compute, and networking.
And then you get to the GPUs itself. GPUs themselves have their pluses and minuses. Right? So you've always had and this goes back to classic computer science trade offs of, like, hey. Do you do you need programmability, or do you need specificity? Right? So do you go general purpose GPU, general purpose CPU, or do you go for ASICs? You always have that, you know, trade off. And the rate of pace of innovation of things keep most of that in like, GPUs is the sweet spot where that innovation will happen for the foreseeable future. And then, yes, if you want to serve, some specific customer with something, you wanna build an ASIC, it takes you two years, then you don't care that you you know, two years from now, you still wanna be serving OPT 1.7 175 b, fine. Build an ASIC that serves OPT 175 b. But it may be irrelevant. So then you try to make that a little more programmable. Then you're like, oh, but it's a little programmable, but it's not fully programmable.
Then it's a trade off. Then you're like, okay. You're you're not in ASIC. You're not here, but you're gonna be halfway there. And then, you know, the halfway point is always, like, the one off benchmark. You look at some some of these ASIC vendors, they're like, oh, we serve Lama 20 x faster. But if you go read the fine print, it's like, yeah, Lama has to be only two k context length, while GPUs are serving, like, 1,000,000 context lengths. So you're saying all your requests will have to fit into, like, a little envelope, and then we'll serve these envelopes very fast. Sure. If that's what, is needed for your customer, great. But but I think, in general, like, it's it's classic general general purpose programmability, and I think there's a market for that. And that's a huge market. It'll be good for that. But that doesn't mean we don't do semi custom or custom. Right? Like, you know, all of the big consoles use AMD. And so we are there the entire journey and but that's on your hardware side or software side for AI.
[00:36:45] Tobias Macey:
As you continue to invest in the future direction and capabilities of the hardware layer and the software support to go along with that and the ways that the AI ecosystem is continuing to stretch and strain those software components. What are the things you have planned for the near to medium term for your own work?
[00:37:06] Anush Elangovan:
Near to medium term, I'd like to make Rockum quality is really high on my mind. It should just be, like, nobody should complain about Rockum quality. That's number one. It should be easily hackable and usable. It should be, like, the hobbyists and developers need to end up there. Because why? Because it's easy to use, and I can just hack it. If something doesn't work, I'll go ahead and do fix it. Right? Like, that should be the mentality. But then we have developers with the developer outreach where we wanna go and get everyone to taste and explore and feel, oh, this is what it feels like to use Rocklin, and then go use that. So those are the three that I'd focus on. But then, obviously, on the other side, I wanna make sure customers are successful. Right? So that's that that's where, you know, rubber meets the road. If the customer is not successful, then that's, you know, not not good either.
[00:37:52] Tobias Macey:
Are there any other aspects of the RockAm ecosystem, the work that you're doing at AMD, or the role of AMD in this broader AI ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:38:06] Anush Elangovan:
I think this is, a good time to be in this space, and we are at an inflection point of both hardware innovation, software innovation, how we touch people, and what it means for, like, humanity at the top level. Right? Like, because increasingly, you're you're gonna be AI assisted in some form. Right? Like, composing an email to do something somewhere is is just you talking to your device or your chatbot to say, hey. Write this up for me, and and it's gonna give you most of the framework, and then you're gonna go and add bells and whistles. Right? So so it's gonna creep up on you in a way that is gonna be profound. And five, ten years from now, people would assume that's how things are done. Right?
So otherwise, you know, from an AMD standpoint, hardware and software, it's just one step in front of the other, and, you know, we're just gonna keep executing and and be there for this, transition in, like, in in this industry.
[00:39:05] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
[00:39:24] Anush Elangovan:
I think it's a combination of all of them. Right? And I think it will go down in like, when we look back after this transition data and this can be, like, twenty years. Right? That it can be over a longer arc of time. It's going to require a combination of all three of them. Like, people will have to retrain to be able to use a a new way of, communication. Right? It's like emails, but it's assisted, and it already knows what the other person's asked. And you've already got a draft that's composed saying this, this, this, or this. Right? And and your job is, like, swipe left, swipe right on, like, okay, this draft or that draft. And and so it may it may distill it down to that. So but then when it comes back to the fundamentals of, like, compute and power to power that compute and AI AI services, it's gonna be another layer of, like, learnings of and development of new tools. The way I look at AI is it's like it's like electricity.
And now we're like, oh, here's electricity. Go do what you have to do. And there's a whole wide world of, like, what we have explored and done with electricity. And we are the same like, you know, we are just in the start of the AI era where we're like, oh, we got AI. Here are frontier models that can do this. And and then, like, is it is it Cursor and and, what are Windsor for coding? Is it the next version of Tanvar to do AI based something? How is salesforce.com gonna be working in an AI agentic world? Right? Like and that's up for someone to go innovate and and figure it out. So it's very exciting and and, to be in this space.
[00:41:08] Tobias Macey:
Absolutely. Well, thank you very much for all of the time and effort you're putting into broadening the availability for people to expand the set of hardware that they're operating on and the ability to tinker with it and understand more about the end to end stack. It's definitely a very interesting problem space. It's great to see that AMD is being so open with their work there. So, appreciate the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.
[00:41:35] Anush Elangovan:
Thank you for having me, Tobias. Thank you.
[00:41:42] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast.init, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at themachinelearningpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hoststhemachinelearningpodcast dot com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today I'm interviewing Anush Ilangavan about AMD's work to expand the playing field for AI training and inference. So, Anush, can you start by introducing yourself?
[00:00:30] Anush Elangovan:
Hey, Tobias. This is Anush. I lead the software team at AMD and for GPU software. Excited to be here.
[00:00:37] Tobias Macey:
And do you remember how you first got started working in ML and AI?
[00:00:42] Anush Elangovan:
Yeah. It's about, ten years ago. Actually, more than ten years ago. We were building smart devices, for IoT, and, and we were building technology that, you know, is this the early days of ImageNet and and things like that. Right? So we were trying to do a gesture recognition with, the earliest, early AI concepts. And then one thing led to another, and we started building ML compilers about, seven years ago. And then about two years ago, AMD saw the, the work we're doing in ML compilers and how we were able to target any ML workloads on any AI accelerator. And, they acquired, Nord dot AI, which I had founded with, with a strong team.
And we became a core part of the, AMD software strategy.
[00:01:35] Tobias Macey:
And so in terms of your work at AMD, obviously, you're very very focused on the software layer. But for a very hardware focused company, I'm wondering what are the elements of that strategy and how they factor into the competitive advantage that you're able to gain by virtue of focusing on the software elements of a very hardware focused problem?
[00:01:58] Anush Elangovan:
Yeah. I I think, AMD has traditionally, you know, done a very good job with hardware. They have a pervasive AI strategy that that goes from embedded laptops gaming PCs, all the way to the data center. And they have a very good cadence of execution of the this hardware. Right? Like, they they're just, like, very relentless in in building good hardware. And about two, three years ago, when, the AI boom was taken off, You know, there was a a focus to bring all of that hardware together with a, software strategy that allowed for, like, a software platform to be built on top of all of this great hardware innovation.
And so that's part of what, you know, our team does, which is build out software and ship software, like a software company, but be part of the AMD, like, innovation engine that builds all of this, awesome hardware. So when you pair that awesome hardware with great software, then it really unlocks, you know, great business value for customers. And and more importantly, it moves the industry forward with respect to, you know, the ecosystem it enables because AMD provides a very open environment for collaboration. And all our source code and software is open source, which allows for, innovation to go at the pace at which the people using our practitioners want to. Right? That is not, limited by the, the ability of what we we put out in closed source form.
[00:03:32] Tobias Macey:
And on the hardware side of the AI ecosystem, obviously, the competitive landscape has been largely dominated at least in terms of marketing by NVIDIA. The CUDA ecosystem has become an entrenched player and gained a lot of adoption because of the fact that it was one of the early movers in the space for numeric computation, which was largely focused on scientific computing and then got parlayed into its current position in the AI landscape. And I'm wondering what you see as the current state of affairs as far as where the attention is going versus the realities as far as the actual capabilities that are provided both at the hardware and in the software layers that are built on top of that. So thinking in terms of the competitive advantages between what CUDA has, it and its relation to NVIDIA hardware and the work that you're doing at AMD with the RACM stack?
[00:04:34] Anush Elangovan:
Very good question. And, just the the first way I'd answer that is capability wise and performance wise, we are, you know, on par or better. And the data to back that up is we we run, like, a million or two models from, from Hugging Face every night to make sure that all the, models run out of the box, you know, without any problems on AMD hardware. So that's that that's kind of pick a thing. And then, and then every model that has been released in the last year, like, llama three, llama four, deep sea, QEM three, everything worked out of the box with whatever ROC conversion we were shipping at the time. And more importantly, it was performant. So right now, if you go look for, like, what is the best way to serve DeepSeek, which is the most popular open source model, it is on MI 300 with Right? That that that is, you know, just performance. Not even no no TCO advantages, nothing else. Right? Just performance wise, it's better than what NVIDIA provides. And so maturity wise, I think we are at a point where it's it's, it's good. But there is also the perception part and and kind of, like, the, like you said, the entrenched, ecosystem part. Right? So if you look back in history, you can always, kind of think of similar transitions in in in technology. Right? You know, there was there was, like, Nokia and Symbian. Right? Like, that was, like, 98%. There was no way that anyone's gonna get anything else. Right? And it was a closed source operating system. It was obviously, there was a inflection point of, like, you know, when smartphones came out and, you know, then then we had the iPhone and we had Android and, you know, just kind of, like, overnight, it took over. But I wouldn't necessarily just say it's exactly like that, but it's a it's a process that once people realize that, hey. This just just works. It clicks and you're like, oh, why would I why would I be doing something in a closed ecosystem? Why would I be paying more for less performance? Right?
Those are all, like, factors that it just disconnects, and then then it's a, snowball effect of, like, okay. Now we just deploy with AMD. Right? And and and AMD has been through this journey with CPUs in 2017. I think 2017, 2018, you know, is just, like, 2% of the market share in data center CPU. Right now, it's, you know, like, it's almost I'd say, I I depending on what's, out there right now, it's almost half the market shares, like, with us. But, that was a journey too. Right? At that time, with 98% market share, the incumbent was, you know, like, there's no way it that anything could change. We think or I personally think the transition and the, adoption in, GPUs and AI software would be a lot faster than what you'd see in CPUs because AI is just about speed. It's about how fast you can move, how fast you can deploy, how fast you get the value for what you're using, and how fast the next technology shows up. And whatever you deployed is no longer useful. Right? So you need the ability to move fast, and that velocity gives you the ability to iterate faster.
And if we focus on our execution, it's, it's ours to, take.
[00:07:51] Tobias Macey:
And then another piece of that puzzle is the communication between the CPU and the GPU and some of the amount of offload that you're able to get onto the GPU as well as the level of efficiency that you're able to get for CPU compute for some of the less parallelizable elements of the model inference and model training. And I'm wondering what are some of the ways that the work that you're doing at ROC m works to optimize some of that communication and offload from CPU to GPU given that you're able to work both on the CPU and the GPU, given that it's all from the main same manufacturer versus the situation with Intel and NVIDIA chips where there are two different manufacturers, and so it requires more organizational communication?
[00:08:39] Anush Elangovan:
That's a very good question. The the good thing, like, philosophically, even though the silicon is from AMD, right, the CPU can be an, AMD CPU. The GPU can be an AMD GPU. Networking could also be from AMD. Right? We make what we call AI NICs. But the standards in which we connect them are open. Right? Like, AMD is part of the UALink standard, the UEC, consortium, the Ultra Ethernet consortium. So we want to make sure that we have the ability to build a reference that has all of these, like, cool integrations. But also, if someone were to build something better, they should be able to drop into that ecosystem and unlock the ecosystem, unlike in the competitive space where it's like, hey, this is the InfiniBand and only we do this. Or, you know, you can, you know, you have to play by these rules. Right? So that's, that's one, which is the ecosystem aspect of it. The second part of what you asked was important because having the ability to build the CPU, which we have built multiple generations of and GPU is now, you know, it's intensely popular. It gives us the insight into both CPU workloads, GPU workloads, and at what point do we transition from CPU to GPU. There's increasingly more focus on test and compute, things like, you know, where where you want to do some sort of post inference or, like, the thinking models that are out there now. Right?
Either that that allows for a a new kind of, scaling. And we are very well positioned for that because we have this ability to run some of these complex algorithms on the CPU, but we also have the ability to do the heavy lifting in the highly parallelizable, infrastructure that the GPU provides.
[00:10:16] Tobias Macey:
Another element of the hardware and software collaboration is in terms of the quantization that you might do for inference time on a model where maybe, natively, it wants to be floating point 16 bit, but you wanna quantize it down for better speed or the ability to run on less capable hardware. I'm wondering how that hardware element factors into the ways that you think about the software layer for either enabling the work of doing that quantization or being able to improve the efficiency of the compute maybe without having to quantize? Yeah.
[00:10:54] Anush Elangovan:
The way I would look at quantization is it almost gets to an art form because you have so many hyperparameters to tweak. Right? But you want to be able to preserve the quality of service of, like, you know, is your model outputting the same, thing? And and and some of it gets very subjective. Right? Like, I I've personally seen deployments where the customer is like, as long as the end user doesn't notice what's going on, you're good. You can quantize it. Right? So then we have some additional aggressive mechanisms to shrink from, like, you know, float 16 to float eight or even, lower. Right? Like, our next generation hardware will have FP four support, which is four bit support. The computation itself is in four bits. Right? And that increasingly gets, like, your your numerical range is just four bits. And and just a few years ago, we were doing everything in FP 32. From 32 to four bits is, like, the industry shifted drastically.
So we still have some of those, SLAs to deliver to and make sure that the model doesn't lose its accuracy. But but that's an interesting area of both research and and production. Right? Just an as an example, like, there was a time, pre LLMs, that int four was considered. Like, you know, int four is for for a time it was considered, it's gonna come. People put it in, in in hardware. I think a 100 had it or something, and then h 100 removed it because there's, like, there's no use for info. And then suddenly, info just became a thing. Right? And LLM showed up, and now info or FP four is, like, it's real. So there was a cycle of, like, oh, quantization below a a particular size is not gonna work out. And then similar to DeepSeek, where DeepSeek showed the first actual training with FP eight with all the tricks they did to get FP eight training to work. So it's a it's a it's a very exciting space and and one that we, at AMD, you know, heavily invested.
[00:12:54] Tobias Macey:
And then for the people who are in the process of designing their AI stack, they're figuring out what to invest in, what are the software components that they want to rely on. How does the fact that RockM is open source versus the proprietary nature of CUDA factor into the ways that they're thinking about platform risk and and the ability to own more of the end to end capabilities of the system?
[00:13:23] Anush Elangovan:
I I think the important thing about AI capabilities are unlocked in a a way that different parts of the system, like, factor into different parts of how what gets unlocked. Right? Like and and let me give you an example. So for example, AMD hardware has or Instinct hardware has more than 50% memory capacity and memory bandwidth. Right? And so what this allows us to do is that we can actually serve decode LLM tokens very fast because it's all bandwidth limited. But then there are cases where you have a first time to token, which is, like, how long does it take to react first? Right? So it's a it's a balance of, like, how do you get enough of this so that you don't degrade your SLA? But then you wanna run fast and generate, like, an entire story fast. But but that requires, like, balancing not just on a, not not just a computational optimization.
It could be, like, power. It could be various other end to end systems that we have to, you know, orchestrate. And the software to manage that gets very interesting and complicated, and we have to build that to be able to to unlock this level of, like, customization.
[00:14:38] Tobias Macey:
And then digging a bit more into the software software stack that you're building, RockM is the the key element of that, but there are also a few other layers available, such as the Hypify ability for being able to translate CUDA into more portable c plus plus. And I'm wondering if you can just give an overview about all of the components that exist in the RockM stack specifically and some of the surrounding ecosystem that builds on top of it.
[00:15:07] Anush Elangovan:
So we we we approach this with, like, a a multilayered approach. One is this. There are customers that have invested in, in in in, like, NVIDIA and down to P TECH level code. Right? They've invested very heavily. So for them, we have the SIPify tool, which takes their CUDA code and converts it to portable HIP code. And portable HIP code can, in turn, be compiled back onto NVIDIA code or systems or onto AMD systems. Right? It literally is making it portable. But then there's a layer on top of it, which is like Triton, which provides you the ability to, like, start from Python and then target AMD or or NVIDIA or other, GPU platforms.
And Triton provides you this, you know, easy to use abstracted GPU programming interface that allows for, like, very easy ramp up to get to, like, almost good performance and then let the compiler do the rest. So we invest both in the first layer, which is, hey. You got some CUDA code and PTAC code. We'll help you translate it. That's Hypify. But then if they if you're coming in on a new BlueField or Greenfield deployments, you kind of, come in and you have Triton as the way that you kind of have to implement it because then it gives you the portability. And it's a modern programming kernel programming language that you could use to, you know, target either of AMD or Nvidia.
[00:16:29] Tobias Macey:
And then one of the main reasons for releasing software as open source is to allow for other participants in the ecosystem to contribute back to it, take it, fork it, customize it. And AMD is definitely the main driving the main driving force behind Broc m given that it is targeting AMD hardware primarily. And I'm curious, what are the incentives that exist in the ecosystem for other people to want to contribute back to that and customize it?
[00:17:02] Anush Elangovan:
Yeah. Very good question. So what we've done in the past few months is actually make Rockham more accessible. It's not just open source by yeah. It's just like, hey. There's something out there, and we'll do whatever we want, and we'll update it and throw the source code over. So we truly are moving to an open development model. And so what this provides is the ability for anyone to take all of the source code and modify it and do what they want to. For example, AMD's, Strix Halo laptops are very popular laptops and desktops because, you know, you get 128 gigs of RAM that can be chain used between your CPU and GPU. And the Rotom builds were getting more and more robust with a solid CI and and the ability to build. And the community was able to add Windows support on, Rakim devices, which we were slowly working towards, but the community was able to pick up and accelerate that pace. And so now, as of yesterday, we have someone who just tweeted about, like, how they are running PyTorch on Windows without WSL on Rocker. Right? Like, that's that that that's the power of the open source ecosystem.
And we don't want to hold up innovation and speed. And that is why we think, you know, we'll we'll run the race faster as we speed up because it's just gonna be more people scratching the edge, doing the thing that they want to do, but then it lifts all boats, and then all boats move faster. Right?
[00:18:33] Tobias Macey:
Another aspect of the ecosystem for hardware for AI is that there are different load patterns, different requirements around the different stages of the life cycle from pretraining to posttraining to fine tuning to, just running it as inference. And for different consumers at different stages of that life cycle, they have a different relationship to the underlying hardware where as an end user who is just using an API for AI inference, I don't necessarily care what the actual hardware is under the covers there. I just care that I get some reasonable response times. Whereas if I'm somebody who is investing a lot into pretraining a substantial model, I'm going to be much more focused on what the hardware is, the software stack on top of it, etcetera. And then in the middle, if I'm a company that's running my own BLLM or SG Lang instance for being able to customize my inference, I'm going to maybe be somewhere in the middle where I've got a cloud provider, and I just say, just give me some GPUs. And I'm wondering how you think about the ways that you're interacting with people along those different stages of that life cycle.
[00:19:46] Anush Elangovan:
That's a very good question, and we you've, like, segregated the the market. The thing for AMD is, like, we plan to go through with every one of those two, those three, users. Right? So there'll be different routes to market, but but we'll get to the end user. And those end users would if they consume from an API, it's just gonna be, yeah, DeepSeeker one, the fastest and the most throughput is on MI 300 today. And so for the end user, they're just gonna see their tokens, you know, still per dollar or, you know, the cost per token to be low, and it's just value provided with that. Right? And and and so, similarly, we're building out how you do distributed inference so you can even get even for the economies of scale by disaggregating, you know, prefill, decode, etcetera.
But then on the other side, we also help, hey. You want GPU as a service, and then you need an entire stack on top. We provide the reference architecture of that stack so that it's not like, hey. We'll give you a GPU, and we don't know what to do on top of it. So you gotta take all the pieces, put them together, tie it in together, and then, you know, move it forward. And then the third one is obviously they it's on prem or they have their own clusters, and, you know, they they know exactly what layer of the software they need, and they'll replace something. They'll have their own Kubernetes deployment layer. They'll have their own training framework that they modified from LM or, you know, or JAX. You know, they're like the super sophisticated Frontier Labs. Right? And, I think that if you look at, like, the top 10 customers, those would be, you know, will fall in that bucket. And and a vast majority of them do have AMD deployments.
So it is, you know, it's a growing phase where I think we will you know, it's just like what we did with Epic. Right? Like, it it's just a matter of time before it's it's an option. And then the others would be TCO advantage by just saying, hey. Yeah. If I serve on MI 300, I get more performance and it costs me less. So I'm just gonna serve the request and MI 300 back end. So that's how, you know, I I look at it.
[00:21:53] Tobias Macey:
Another element to the software investment for powering particularly these very numeric and parallelizable workloads that are epitomized by these large language models is that there has been a lot of effort going into now trying to reduce the computational load for training and for inference either through different attention mechanisms or the work that's going on with mixture of experts and then also some of the newer architectural patterns where transformers are trying to move beyond transformers. I'm thinking in particular, Mamba or the diffusion based models or the work that, the liquid networks folks are doing. And I'm wondering how that feeds back into the ways that you're thinking about the software stack that is able to improve the efficiency or the capabilities of these different model architectures and these different compute patterns that get translated to the underlying hardware.
[00:22:52] Anush Elangovan:
Yeah. That's a very good question. The the way I, the way I I look at it is it's like, layered cheese. Right? So the the you you just have to invest in each layer to be robust and solid so that when there's a switch from transformers to Mamba or SSMs or, you know, other, architectures, it's not a question of if. It's a question of when. When that happens, we already have the underlying infrastructure and the and the core compute, you know, disaggregation and and how we network all of these compute. Those libraries and and infrastructure is robust enough for a switch at the top level, right, at the at the modeling level. But then on top, again, it'll there there'll be another level of, like, you know, bringing it back to an API. Right? Like, in the end, the last mile for the end user is gonna be an API. And that's gonna be, you know, something like a open API open AI, like, API for LLMs. But that can also change and evolve over time. But there's a fan out in between the very low levels of the software stack. And then as you get to the middle layer, there are paths of innovation that will, you know, we explore.
And I can tell you at AMD and, you know, from my view, all of those paths that you talked about, we are very well invested in, and and we bring, bring it up there. So I I think the the way I approach it is how do you be prepared to react, and how fast can you react? That's gonna be the currency for being relevant in this fast moving AI world.
[00:24:27] Tobias Macey:
And then the other potential target for an end user is for people who wanna be able to do their own local inference on a laptop using these open models, and they wanna be able to do it reasonably, affordably without having to run their own power plant to be able to power the cards. And I'm wondering how you're thinking about that consumer grade aspect of AI and the ways that people are able to run them on their local laptops, on their desktop machines, or on edge compute for being able to bring this into more of a personalized mode without having to necessarily be a AI researcher and understand all of the things that they need to know to be able to quantize the models and fine tune them to be able to run effectively on the silicon that they have and whatever device they might have ready to hand?
[00:25:18] Anush Elangovan:
Yeah. So AMD has a very pervasive AI story. Right? Like, we we could go all the way embedded, auto automotive, like, autopilot devices. And then laptops, like the Strix Halo laptop that I was talking about, it has, like, 50 tops of, compute on an NPU. So NPU, think of it as a when you need efficiency when you can trade efficiency to programmability, you go for an NPU. When you can trade when you want the programmability, you use an iGPU. But But then you also have a CPU, which is your fallback. Right? And all of those pieces of silicon are AMD built and in the laptop form factor. So in fact, I have my Styx Halo laptop over here, and it's got all of that tied together and an open source software stack that runs everything on that is open source. Right? So it's not like, hey. Install CUDA 5.8 something. Right? If you wanna build it, you can take it, build it, and tinker with it. Right? So so it gives you that ability, and we see incredible innovation with, you know, the likes of LAMA CPP and and, OLLAMA and and, you know, the client side of, inferencing, where they go in and kind of make these, big innovations and and especially in quantization.
Right? So we're going back to your original point, like, you know, Lava CPP was the first one to make, like, q eight, q six, q five, and you could change only a few tensors to be this, and your, perplexity looks okay. So it almost gave you, like, per tensor level fidelity to be like, okay. This model can now run on this, and then there's a a big lever for how much VRAM you have. So they map VRAM to this, and they find a a sweet spot of a model that can run locally. And and I think increasingly, we're gonna see even algorithmic innovations where you have, like, a active critic model or or a student teacher models. And I think if I remember right, LAMA four, Behemoth was also used in in a similar way, right, like, where you use the very large model to be the teacher, and then your small model distills and learns from the teacher. But it's a smaller model. So there are, like, innovations happening at each layer of the cheese. Right? It's at the hardware level, quantization level, at the, modeling level. And put all of them together, I think, increasingly, you'll see a lot of local AI.
[00:27:32] Tobias Macey:
And in the work that you're doing of building this software ecosystem on top of the core silicon capabilities, the work that you're doing to make it open and accessible and flexible. What are some of the most interesting or innovative or unexpected ways that you're seeing that combination of AMD hardware and the RockM ecosystem applied?
[00:27:54] Anush Elangovan:
Innovative ways that we're seeing that that we hadn't envisioned that people are are using it in. So the cool things I've seen, like, I've seen an AMD GPU, like, a normal PCIe card suck on, like, a RISC five CPU, and they built a little thing with with a, AMD GPU that wasn't even designed to run on a host with RISC five. But given it's open, they could they could actually port the driver and get it to work. And they actually just have a solid graphics card now running on a completely new CPU architecture because of the open, you know, like, architecture that we have open ecosystem that we have at at AMD. I've also seen some the, mention of, like, Windows portability. Right? Like, you know, the community did a lot of the work to get Windows portability off Rockham, and now we have, Rockum that just works on Windows. We do have the likes of, like, TinyGrad, where they are building a lot a lot of, AMD cards together to put together a cost efficient petaflop machine.
That's also very exciting. Yeah. It's a lot of, innovation at the application space with the local AI where you can run your, like, coding assistant on your laptop, and it's answering questions in real time. That's really cool to see. Yeah. You know, things like that are and and there's more coming every day, you know. You just have to keep up with.
[00:29:21] Tobias Macey:
And in terms of your own experience of working in this ecosystem, working very closely with the AMD hardware units and the end users and the software contributors? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:29:40] Anush Elangovan:
I think, as Rockham and AMD software has a a long history quick, and it has evolved, and so there's a perception hurdle that we had to get over. I I literally am on tech or Twitter listening for anyone complaining on, like, about AMD software or Rotom software. This is not working or that's not working. And I try to navigate and try to get get them the support that they need. So it requires a little bit of reeducating and asking folks that may have formed an opinion on AMD software to see where we are now. And that's always a little bit of a thing, right, because you've had an experience that was subpar at some point, but now you're asking for your second chance. Navigating that is interesting. And then just communicating what we are doing. Like, do do people know about Rockum? Unlikely. Right? Do people know what's the best GPU to serve DeepSeek on? Unlikely.
But that should be straightforward. It should be like, the fastest way to serve DeepSeek is on AMD. That's like a fact, and that fact is just about how we market that and get that out. Right? So, hopefully, we'll work on that and help people kind of, like, realize the value of what they can get with AMD hardware and software.
[00:30:52] Tobias Macey:
And the talking about the speed of inference too just made me think about another element of the serving piece is that oftentimes, you wanna be able to run multiple copies of the model if you're trying to do it for a number of different users simultaneously, which then brings in questions around being able to split the models across multiple GPU instances or being able to manage coordination of multiple GPU instances to pool together in a single batch of compute. And, obviously, AMD is being used in various super compute ecosystems. I'm wondering how you're thinking about that ease of use of being able to pool together multiple hardware units to be able to manage this inference serving or or for training.
[00:31:39] Anush Elangovan:
Very good question. But, you know, I would actually take it a level lower. AMD is the industry leader in chiplet design. Right? Our competition is gonna do that in a year or two, right, where they actually get chiplets. The MI 300 today and MI 250s and, you know, like for a few generations, we have built multiple chips that are in one big package. And so what that provides is we are already doing that at a chip level. Today's MI 300 has something called the CPX mode, which gives you eight levels of eight chips with 24 gigs of HBM each that you can actually partition as an independent GPU, and you could have normal locality to the CPU. It is actually, like, physically, you turn off the connections and you have a GPU. Right? And so what that provides is the ability to build from, like, fundamental building blocks, even at the SoC level. And then that gets up to the GPU level, and you have eight GPUs at a a UBB level. And then you get to a a multi node rack level, and then from a rack to a cluster level. And each of that kind of, like, inherently, if you build it from a first principles of how you do communication and compute, you start building a very, very robust, you know, software architecture. And there are challenges in how AMD had done it in the past that we are kind of, like, you know, have evolved over time. But today, going back to chiplet designs, you could actually partition each one of these GPUs, and we have instances of VLLM or SG LAN running on each one. So you have eight instances of VLLM serving Lama eight b, for example. And it's fully isolated. Right? So you get eight extra throughput just by parallelizing, you know, at, at the chip net level. So it's very exciting for computer scientists and for, folks that want to, like, just deal with operating system principles, like resources and allocation and and how you're efficiently using compute communication. Right? So AMD is built for that, which which is what I'd say. If you if you are in that space, we're obviously hiring. Drop me a note and blah blah blah. Well, it's it's a lot of fun activities to do and engineering to do with, with that.
[00:33:47] Tobias Macey:
And given all of the conversation we've had around the benefits of AMD hardware, the Rackham ecosystem, what are the cases where one or the other or both are the wrong choice?
[00:33:59] Anush Elangovan:
If there is a wrong choice, let me know. I'll make it the right choice. But, I I think I think there are there are cases where, you know I I wouldn't say see, the Rockum software should become invisible. It should just become magic. Right? It's just it should just be, hey. There's hardware resources, and there are APIs to do something. Right? And then you can do something on top. And what that something is is like VLLM serving. And then on top of it, you build something else. Right? So so Rotcombe as a interface should just disappear. Like, people should not have to think about it as a moat because it's open source, and it should not be thought about as, like, buggy because it doesn't exist in your it's just there. It's like ambient enablement of the compute, and networking.
And then you get to the GPUs itself. GPUs themselves have their pluses and minuses. Right? So you've always had and this goes back to classic computer science trade offs of, like, hey. Do you do you need programmability, or do you need specificity? Right? So do you go general purpose GPU, general purpose CPU, or do you go for ASICs? You always have that, you know, trade off. And the rate of pace of innovation of things keep most of that in like, GPUs is the sweet spot where that innovation will happen for the foreseeable future. And then, yes, if you want to serve, some specific customer with something, you wanna build an ASIC, it takes you two years, then you don't care that you you know, two years from now, you still wanna be serving OPT 1.7 175 b, fine. Build an ASIC that serves OPT 175 b. But it may be irrelevant. So then you try to make that a little more programmable. Then you're like, oh, but it's a little programmable, but it's not fully programmable.
Then it's a trade off. Then you're like, okay. You're you're not in ASIC. You're not here, but you're gonna be halfway there. And then, you know, the halfway point is always, like, the one off benchmark. You look at some some of these ASIC vendors, they're like, oh, we serve Lama 20 x faster. But if you go read the fine print, it's like, yeah, Lama has to be only two k context length, while GPUs are serving, like, 1,000,000 context lengths. So you're saying all your requests will have to fit into, like, a little envelope, and then we'll serve these envelopes very fast. Sure. If that's what, is needed for your customer, great. But but I think, in general, like, it's it's classic general general purpose programmability, and I think there's a market for that. And that's a huge market. It'll be good for that. But that doesn't mean we don't do semi custom or custom. Right? Like, you know, all of the big consoles use AMD. And so we are there the entire journey and but that's on your hardware side or software side for AI.
[00:36:45] Tobias Macey:
As you continue to invest in the future direction and capabilities of the hardware layer and the software support to go along with that and the ways that the AI ecosystem is continuing to stretch and strain those software components. What are the things you have planned for the near to medium term for your own work?
[00:37:06] Anush Elangovan:
Near to medium term, I'd like to make Rockum quality is really high on my mind. It should just be, like, nobody should complain about Rockum quality. That's number one. It should be easily hackable and usable. It should be, like, the hobbyists and developers need to end up there. Because why? Because it's easy to use, and I can just hack it. If something doesn't work, I'll go ahead and do fix it. Right? Like, that should be the mentality. But then we have developers with the developer outreach where we wanna go and get everyone to taste and explore and feel, oh, this is what it feels like to use Rocklin, and then go use that. So those are the three that I'd focus on. But then, obviously, on the other side, I wanna make sure customers are successful. Right? So that's that that's where, you know, rubber meets the road. If the customer is not successful, then that's, you know, not not good either.
[00:37:52] Tobias Macey:
Are there any other aspects of the RockAm ecosystem, the work that you're doing at AMD, or the role of AMD in this broader AI ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:38:06] Anush Elangovan:
I think this is, a good time to be in this space, and we are at an inflection point of both hardware innovation, software innovation, how we touch people, and what it means for, like, humanity at the top level. Right? Like, because increasingly, you're you're gonna be AI assisted in some form. Right? Like, composing an email to do something somewhere is is just you talking to your device or your chatbot to say, hey. Write this up for me, and and it's gonna give you most of the framework, and then you're gonna go and add bells and whistles. Right? So so it's gonna creep up on you in a way that is gonna be profound. And five, ten years from now, people would assume that's how things are done. Right?
So otherwise, you know, from an AMD standpoint, hardware and software, it's just one step in front of the other, and, you know, we're just gonna keep executing and and be there for this, transition in, like, in in this industry.
[00:39:05] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
[00:39:24] Anush Elangovan:
I think it's a combination of all of them. Right? And I think it will go down in like, when we look back after this transition data and this can be, like, twenty years. Right? That it can be over a longer arc of time. It's going to require a combination of all three of them. Like, people will have to retrain to be able to use a a new way of, communication. Right? It's like emails, but it's assisted, and it already knows what the other person's asked. And you've already got a draft that's composed saying this, this, this, or this. Right? And and your job is, like, swipe left, swipe right on, like, okay, this draft or that draft. And and so it may it may distill it down to that. So but then when it comes back to the fundamentals of, like, compute and power to power that compute and AI AI services, it's gonna be another layer of, like, learnings of and development of new tools. The way I look at AI is it's like it's like electricity.
And now we're like, oh, here's electricity. Go do what you have to do. And there's a whole wide world of, like, what we have explored and done with electricity. And we are the same like, you know, we are just in the start of the AI era where we're like, oh, we got AI. Here are frontier models that can do this. And and then, like, is it is it Cursor and and, what are Windsor for coding? Is it the next version of Tanvar to do AI based something? How is salesforce.com gonna be working in an AI agentic world? Right? Like and that's up for someone to go innovate and and figure it out. So it's very exciting and and, to be in this space.
[00:41:08] Tobias Macey:
Absolutely. Well, thank you very much for all of the time and effort you're putting into broadening the availability for people to expand the set of hardware that they're operating on and the ability to tinker with it and understand more about the end to end stack. It's definitely a very interesting problem space. It's great to see that AMD is being so open with their work there. So, appreciate the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.
[00:41:35] Anush Elangovan:
Thank you for having me, Tobias. Thank you.
[00:41:42] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast.init, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at themachinelearningpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hoststhemachinelearningpodcast dot com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to AI Engineering Podcast
Interview with Anush Ilangavan from AMD
AMD's Software Strategy in AI
Competitive Landscape: AMD vs NVIDIA
Transition and Adoption in AI Hardware
Quantization and Model Efficiency
Open Source and Platform Risk
AI Hardware Ecosystem and Consumer Interaction
Local AI and Consumer Grade Applications
Innovative Applications of AMD Hardware
Pooling Hardware for AI Inference
Future Directions for AMD and RockM
RockM Quality and Developer Outreach
AMD's Role in the AI Ecosystem