Unlocking AI Potential with AMD's ROCm Stack

Hello, and welcome to the AI Engineering podcast,

your guide to the fast moving world of building scalable and maintainable

AI systems.

Your host is Tobias Macy, and today I'm interviewing Anush Ilangavan about AMD's work to expand the playing field for AI training and inference. So, Anush, can you start by introducing yourself?

Hey, Tobias. This is Anush. I lead the software team at AMD and for GPU software.

Excited to be here.

And do you remember how you first got started working in ML and AI?

Yeah. It's about,

ten years ago. Actually, more than ten years ago. We were building

smart devices,

for IoT, and, and we were building technology that, you know, is this the early days of ImageNet and and things like that. Right? So we were trying to do a gesture recognition

with,

the earliest,

early AI

concepts. And then one thing led to another, and we started building ML compilers about, seven years ago.

And then about two years ago,

AMD saw the,

the work we're doing in ML compilers and how we were able to target

any ML workloads on any AI accelerator. And,

they acquired,

Nord dot AI, which I had founded with, with a strong team.

And we became a core part of the, AMD

software

strategy.

And so in terms of your work at AMD, obviously, you're very very focused on the software layer. But for a very hardware focused company, I'm wondering what are the elements of that strategy and how they factor into the competitive advantage that you're able to

gain by virtue of focusing on the software elements of a very hardware focused problem?

Yeah. I I think,

AMD has traditionally,

you know, done a very good job with hardware. They have

a pervasive AI strategy that that goes from embedded laptops

gaming PCs, all the way to the data center.

And they have a very good cadence of execution of the this hardware. Right? Like, they they're just, like, very relentless in in building good hardware. And about two, three years ago, when, the AI boom was taken off, You know, there was a a focus to bring all of that hardware together with a,

software strategy that allowed for, like, a software platform

to be built on top of all of this great hardware innovation.

And so that's part of what, you know,

our team does, which is build out

software and ship software, like a software company, but be part of the AMD, like, innovation engine that builds all of this, awesome hardware. So when you pair that awesome hardware with great software, then it really unlocks,

you know, great business value for customers.

And and more importantly,

it moves the industry forward with respect to, you know, the ecosystem it enables because AMD provides a very

open environment for collaboration. And all our source code and software is open source, which allows for,

innovation to go at the pace at which the people using our practitioners want to. Right? That is not, limited by the,

the ability of what we we put out in closed source form.

And on the hardware side of the AI ecosystem,

obviously, the competitive landscape

has been largely dominated at least in terms of marketing

by NVIDIA.

The CUDA ecosystem

has become an entrenched player and gained a lot of

adoption because of the fact that it was one of the early movers in the space for numeric computation,

which was largely focused on scientific computing and then got parlayed into

its current position in the AI landscape.

And I'm wondering what you see as the

current

state of affairs as far as where the attention is going versus the realities as far as the actual capabilities that are provided

both at the hardware and in the software layers that are built on top of that. So thinking in terms of the competitive advantages between what CUDA has, it and its relation to NVIDIA hardware

and the work that you're doing at AMD with the RACM stack?

Very good question. And, just the the first way I'd answer that is capability wise and performance wise, we are, you know, on par or better. And the data to back that up is we we run, like, a million or two models

from, from Hugging Face every night to make sure that all the,

models run out of the box, you know, without any problems

on AMD hardware. So that's that that's kind of pick a thing. And then, and then every model that has been released in the last year, like, llama three, llama four,

deep sea, QEM three, everything worked out of the box with whatever ROC conversion we were shipping at the time. And more importantly, it was performant. So right now, if you go look for, like, what is the best way to serve DeepSeek, which is the most popular

open source model,

it is on MI 300 with Right? That that that is,

you know, just performance. Not even no no TCO advantages, nothing else. Right? Just performance wise, it's better than what NVIDIA provides. And so maturity wise, I think we are at a point where it's it's,

it's good. But there is also the perception part

and and kind of, like, the, like you said, the entrenched,

ecosystem part. Right? So if you look back in history, you can always,

kind of think of similar

transitions in in

in technology. Right? You know, there was there was, like, Nokia and Symbian. Right? Like, that was, like, 98%. There was no way that anyone's gonna get anything else. Right? And it was a closed source operating system. It was obviously, there was a inflection point of, like, you know, when smartphones came out and, you know, then then we had the iPhone and we had Android and, you know, just kind of, like, overnight, it took over. But I wouldn't necessarily just say it's exactly like that, but it's a it's a process that once people realize that, hey. This just just works. It clicks and you're like, oh, why would I why would I be doing something in a closed ecosystem? Why would I be paying

more for less performance? Right?

Those are all, like, factors that it just

disconnects, and then then it's a,

snowball effect of, like, okay. Now we just deploy with AMD. Right? And and and AMD has been through this journey with CPUs

in 2017.

I think 2017, 2018, you know, is just, like, 2% of the market share in data center CPU. Right now, it's, you know,

like, it's almost

I'd say, I I depending on what's,

out there right now, it's almost half the market shares, like, with us. But,

that was a journey too. Right? At that time, with 98% market share, the incumbent was, you know, like, there's no way it that anything could change.

We think or I personally think

the transition and the,

adoption

in, GPUs and AI software would be a lot faster than what you'd see in CPUs

because AI is just about speed. It's about how fast you can move, how fast you can deploy, how fast you get the value for what you're using, and how fast the next technology shows up. And whatever you deployed is no longer useful. Right? So you need the ability to move fast, and that velocity gives you the ability to iterate faster.

And if we focus on our execution, it's, it's ours to, take.

And then another piece of that puzzle is the communication

between the CPU and the GPU and some of the amount of offload that you're able to get onto the GPU

as well as the level of efficiency that you're able to get for CPU compute for some of the

less parallelizable

elements of the model inference and model training. And I'm wondering what are some of the ways that the work that you're doing at ROC m works to optimize some of that communication

and offload from CPU to GPU given that you're able to work both on the CPU and the GPU, given that it's all from the main same manufacturer

versus the situation with Intel and NVIDIA chips where there are two different manufacturers, and so it requires more organizational

communication?

That's a very good question.

The the good thing, like, philosophically,

even though the silicon is from AMD, right, the CPU can be an, AMD CPU. The GPU can be an AMD GPU. Networking could also be from AMD. Right? We make what we call AI NICs. But the standards in which we connect them are open. Right? Like, AMD is part of the UALink standard, the UEC,

consortium, the Ultra Ethernet consortium. So we want to make sure that we have the ability to build a reference that has all of these, like, cool

integrations. But also, if someone were to build something better, they should be able to drop into that ecosystem and unlock the ecosystem, unlike in the competitive space where it's like, hey, this is the InfiniBand and only we do this. Or, you know, you can, you know, you have to play by these rules. Right? So that's, that's one, which is the ecosystem aspect of it. The second part of what you asked was

important because having the ability to build the CPU, which we have built multiple generations of and GPU is now, you know, it's intensely popular. It gives us the insight into both CPU workloads, GPU workloads, and at what point do we transition from CPU to GPU. There's increasingly

more focus on test and compute, things like, you know, where where you want to do some sort of post inference or, like, the thinking models that are out there now. Right?

Either that that allows for a

a new kind of,

scaling.

And we are very well positioned for that because we have this ability to run some of these complex algorithms on the CPU, but we also have the ability to do the heavy lifting in the highly parallelizable,

infrastructure that the GPU provides.

Another element of the hardware and software

collaboration

is in terms of the quantization

that you might do for inference time on a model where maybe, natively, it wants to be floating point 16 bit, but you wanna quantize it down for better speed or the ability to run on less capable hardware.

I'm wondering how that hardware element factors into the ways that you think about the software layer for

either

enabling the

work of doing that quantization

or being able to improve the efficiency

of the compute maybe without having to quantize?

Yeah.

The

way I would look at quantization

is it almost gets to an art form because you have so many hyperparameters

to tweak. Right? But you want to be able to preserve the quality of service of, like, you know, is your model outputting the same, thing? And and and some of it gets very subjective. Right? Like, I I've personally seen

deployments where the customer is like, as long as the end user doesn't notice what's going on, you're good. You can quantize it. Right? So then we have some additional

aggressive mechanisms

to shrink from, like, you know, float 16 to float eight or even, lower. Right? Like, our next generation hardware will have FP four support, which is four bit support. The computation itself is in four bits. Right? And that increasingly gets, like,

your your numerical range is just four bits. And and just a few years ago, we were doing everything in FP 32. From 32 to four bits is, like, the industry shifted

drastically.

So we still have some of those,

SLAs to deliver to and make sure that the model doesn't lose its accuracy. But but that's an interesting

area of both research and and production. Right?

Just an as an example,

like, there was a time, pre LLMs, that int four was considered. Like, you know, int four is for for a time it was considered, it's gonna come. People put it in, in in hardware. I think a 100 had it or something, and then h 100

removed it because there's, like, there's no use for info.

And then

suddenly, info just became a thing. Right? And LLM showed up, and now info or FP four is, like, it's real. So there was a cycle of, like, oh, quantization

below a a particular size is not gonna work out. And then similar to DeepSeek, where DeepSeek showed

the first actual training

with FP eight with all the tricks they did to get FP eight training to work. So it's a it's a it's a very

exciting space and and one that we, at AMD, you know, heavily invested.

And then for the

people who are in the process of designing their

AI stack, they're figuring out what to invest in, what are the software components that they want to rely on. How does the fact that RockM is open source versus the proprietary nature of CUDA factor into the ways that they're thinking about platform risk and and the ability to own more of the end to end capabilities of the system?

I I think the important thing about AI capabilities

are unlocked in

a a way that different parts of the system, like, factor into different parts of how what gets unlocked. Right? Like and and let me give you an example. So for example,

AMD hardware has or Instinct hardware has more than 50% memory capacity and memory bandwidth. Right? And so what this allows us to do is that we can actually serve decode LLM tokens very fast because it's all bandwidth

limited.

But then there are cases where you have a

first time to token, which is, like, how long does it take to react first? Right? So it's a

it's a balance of, like, how do you get enough of this so that you don't degrade your SLA? But then you wanna run fast and generate, like, an entire story fast. But but that requires, like, balancing

not just on

a,

not not just a

computational

optimization.

It could be, like, power. It could be various other end to end systems that we have to, you know, orchestrate. And the software to manage that gets very interesting and complicated, and we have to build that to be able to to unlock this level of, like, customization.

And then digging a bit more into the software software stack that you're building, RockM is the the key element of that, but there are also a few other layers available,

such as the Hypify

ability for being able to translate CUDA into more portable c plus plus. And I'm wondering if you can just give an overview about all of the components that exist in the RockM stack specifically and some of the surrounding ecosystem that builds on top of it.

So we we we approach this with, like, a a multilayered approach. One is this. There are customers that have invested in,

in in in, like, NVIDIA and down to P TECH level code. Right? They've invested very heavily. So for them, we have the SIPify tool, which takes their CUDA code and converts it to portable HIP code. And portable HIP code can, in turn, be compiled back onto NVIDIA code or systems or onto AMD systems. Right? It literally is making it portable. But then there's a layer on top of it, which is like Triton, which provides you the ability to, like, start from Python and then target

AMD or or NVIDIA or other, GPU platforms.

And Triton provides you this, you know, easy to use abstracted GPU programming interface that allows for, like, very

easy ramp up to get to, like, almost good performance and then let the compiler do the rest. So we invest both in the first layer, which is, hey. You got some CUDA code and PTAC code. We'll help you translate it. That's Hypify.

But then if they if you're coming in on a new BlueField

or Greenfield deployments,

you kind of,

come in and you have

Triton as the way that you kind of have to implement it because then it gives you the portability.

And it's a modern programming kernel programming language that you could use to, you know, target either of AMD or Nvidia.

And then one of the main reasons for releasing software as open source is

to allow for

other participants in the ecosystem to contribute back to it, take it, fork it, customize it. And AMD is definitely the

main driving the main driving force behind Broc m given that it is targeting AMD hardware primarily.

And I'm curious, what are the incentives that exist in the ecosystem

for other people to want to contribute back to that and customize it?

Yeah. Very good question. So what we've done in the past few months is actually make Rockham more accessible.

It's not just open source by yeah. It's just like, hey. There's something out there, and we'll do whatever we want, and we'll update it and throw the source code over. So we truly are moving to an open development model. And so what this provides is the ability for anyone to take all of the source code and modify it and do what they want to. For example,

AMD's,

Strix Halo

laptops

are very popular laptops and desktops because, you know, you get 128 gigs of RAM that can be chain used between your CPU and GPU.

And the Rotom builds were getting more and more robust with a solid CI and and the ability to build. And the community was able to add Windows support on,

Rakim devices,

which we were slowly working towards, but the community was able to pick up and accelerate that pace. And so now, as of yesterday, we have someone who just tweeted about, like, how they are running PyTorch

on Windows without WSL

on Rocker. Right? Like, that's that that that's the power of the open source ecosystem.

And we don't want to hold up innovation and speed.

And that is why we think, you know, we'll we'll run the race faster

as we speed up because

it's just gonna be more people

scratching the edge, doing the thing that they want to do, but then it lifts all boats, and then all boats move faster. Right?

Another aspect of the ecosystem

for hardware for

AI is that there are different load patterns, different requirements

around the different stages of the life cycle from pretraining to posttraining to fine tuning to,

just running it as inference.

And for different consumers at different stages of that life cycle, they have a different relationship to the underlying hardware where as an end user who is just using an

API for AI inference, I don't necessarily care what the actual hardware is under the covers there. I just care that I get some reasonable response times.

Whereas if I'm somebody who is investing a lot into

pretraining a

substantial model, I'm going to be much more focused on what the hardware is, the software stack on top of it, etcetera. And then in the middle, if I'm a company that's running my own BLLM or SG Lang instance for being able to customize my inference, I'm going to maybe be somewhere in the middle where I've got a cloud provider, and I just say, just give me some GPUs. And I'm wondering how you think about the ways that you're interacting with people along those different stages of that life cycle.

That's a very good question, and we you've, like, segregated the the market. The thing for AMD is, like, we plan to go through with every one of those two, those three, users. Right? So there'll be different routes to market, but but we'll get to the end user. And those end users would if they consume from an API, it's just gonna be, yeah, DeepSeeker one,

the fastest and the most throughput is on MI 300 today. And so for the end user, they're just gonna see their tokens,

you know, still per dollar or,

you know, the cost per token to be low, and it's just value provided with that. Right? And and and so, similarly, we're building out how you do distributed inference so you can even get even for the economies of scale by disaggregating,

you know, prefill, decode, etcetera.

But then on the other side, we also help, hey. You want GPU as a service, and then you need an entire stack on top.

We provide the reference architecture of that stack so that it's not like, hey. We'll give you a GPU, and we don't know what to do on top of it. So you gotta take all the pieces, put them together,

tie it in together, and then, you know, move it forward. And then the third one is obviously

they it's on prem or they have their own clusters, and,

you know, they they know exactly what layer of the software they need, and they'll replace something. They'll have their own Kubernetes deployment layer. They'll have their own training framework that they modified from

LM or, you know, or JAX. You know, they're like the super sophisticated

Frontier Labs. Right? And,

I think that if you look at, like, the top 10

customers, those would be, you know, will fall in that bucket. And and a vast majority of them do have AMD deployments.

So it is, you know, it's a growing phase where I think we will you know, it's just like what we did with Epic. Right? Like, it it's just a matter of time before

it's it's an option. And then the others would be TCO advantage by just saying, hey. Yeah. If I serve on MI 300, I get more performance and it costs me less. So I'm just gonna serve the request and MI 300 back end. So that's how, you know, I I look at it.

Another element to

the software investment for powering

particularly these very numeric

and parallelizable

workloads that are epitomized by these large language models is that there has been a lot of effort going into now trying

to reduce the computational

load for training and for inference either through

different attention mechanisms

or the work that's going on with mixture of experts and then also some of the newer

architectural

patterns where transformers are trying to move beyond transformers. I'm thinking in particular, Mamba or the diffusion based models or the work that, the liquid networks folks are doing. And I'm wondering how that feeds back into the ways that you're thinking about the software stack that is able

to improve the efficiency or the capabilities of these different model architectures and these different compute patterns that get translated to the underlying hardware.

Yeah. That's a very good question. The the way I,

the way I I look at it is it's like,

layered cheese. Right? So the the you you just have to invest in each layer to be robust and solid so that when there's a switch from transformers to Mamba or SSMs or, you know, other,

architectures,

it's not a question of if. It's a question of when. When that happens,

we already have the underlying infrastructure and the and the

core compute,

you know, disaggregation

and and how we network all of these compute.

Those libraries and and infrastructure is robust enough for a switch at the top level, right, at the at the modeling level. But then on top, again, it'll there there'll be another level of, like, you know, bringing it back to an API. Right? Like, in the end, the last mile for the end user is gonna be an API. And that's gonna be, you know, something like a open API open AI, like, API for LLMs. But that can also change and evolve over time. But there's a fan out in between the very low levels of the software stack. And then as you get to the middle layer, there are paths of innovation

that will, you know, we explore.

And I can tell you at AMD and, you know, from my view, all of those paths that you talked about, we are very well invested in, and and we bring, bring it up there. So I I think the the way I approach it is how do you be prepared

to react, and how fast can you react? That's gonna be the currency

for being relevant in this fast moving AI world.

And then the other potential target for an end user is for people who wanna be able to do their own local inference on a laptop using these open models, and they wanna be able to do it reasonably, affordably without having to run their own power plant to be able to power the cards. And I'm wondering how you're thinking about that consumer grade aspect

of

AI and the ways that people are able to run them on their local laptops, on their desktop machines, or on edge compute

for being able to bring this into more of a personalized

mode without having to necessarily be a AI researcher and understand all of the things that they need to know to be able to quantize the models and fine tune them to be able to run effectively on the silicon that they have and whatever device they might have ready to hand?

Yeah. So AMD has a very pervasive AI story. Right? Like, we we could go all the way embedded, auto automotive, like, autopilot devices. And then laptops, like the Strix Halo laptop that I was talking about, it has, like, 50 tops of,

compute on an NPU. So NPU, think of it as a when you need efficiency when you can trade efficiency to programmability, you go for an NPU. When you can trade

when you want the programmability,

you use an iGPU. But But then you also have a CPU, which is your fallback. Right? And all of those pieces of silicon are AMD built and in the laptop form factor. So in fact, I have my Styx Halo laptop over here, and it's got all of that tied together and an open source software stack that runs everything on that is open source. Right? So it's not like, hey. Install CUDA 5.8 something. Right? If you wanna build it, you can take it, build it, and tinker with it. Right? So so it gives you that ability, and we see incredible innovation with, you know, the likes of LAMA CPP and and, OLLAMA and and, you know, the client

side of,

inferencing, where they go in and kind of make these,

big innovations

and and especially in quantization.

Right? So we're going back to your original point, like, you know, Lava CPP was the first one to make, like, q eight, q six, q five, and you could change only a few tensors to be this, and your,

perplexity looks okay. So it almost gave you, like, per tensor level

fidelity to be like, okay. This model can now run on this, and then there's a a big lever for how much VRAM you have. So they map VRAM to this, and they find

a a sweet spot of a model that can run locally.

And and I think increasingly, we're gonna see even algorithmic innovations where you have, like, a active critic model or or a student teacher models. And I think if I remember right, LAMA four, Behemoth was also used in in a similar way, right, like, where you use the very large model to be the teacher, and then your small model distills and learns from the teacher. But it's a smaller model. So there are, like, innovations happening at each layer of the cheese. Right? It's at the hardware level, quantization level, at the, modeling level. And put all of them together, I think, increasingly, you'll see a lot of local AI.

And in the work that you're doing of building this software ecosystem on top of the core silicon capabilities,

the work that you're doing to make it open and accessible and flexible. What are some of the most interesting or innovative or unexpected ways that you're seeing that combination of AMD hardware and the RockM ecosystem applied?

Innovative ways that we're seeing that that we hadn't envisioned that people are are using it in. So the cool

things I've seen, like, I've seen an AMD

GPU,

like, a normal PCIe card suck on, like, a RISC five CPU, and they built a little

thing with with a,

AMD GPU that wasn't even designed to run on a host with RISC five. But given it's open, they could they could actually port the driver and get it to work. And they actually just have

a solid graphics card now running on a completely

new

CPU architecture

because of the open, you know,

like, architecture that we have open ecosystem that we have at at AMD. I've also seen some the, mention of, like, Windows portability. Right? Like, you know, the community did a lot of the work to get Windows portability

off Rockham, and now we have, Rockum that just works on Windows. We do have the likes of, like, TinyGrad,

where they are building

a lot a lot of, AMD cards together to put together a cost efficient

petaflop

machine.

That's also very exciting. Yeah.

It's a lot of, innovation at the application space with the local AI where you can run your, like, coding assistant on your laptop, and it's answering questions in real time. That's really cool to see. Yeah. You know, things like that are and and there's more coming every day, you know. You just have to keep up with.

And in terms of your own experience of working in this ecosystem,

working very closely with the AMD

hardware units and the end users and the software contributors? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think,

as Rockham and AMD software has a a long history quick, and it has evolved,

and so there's a perception

hurdle that we had to get over. I I literally am on tech or Twitter

listening for anyone complaining on, like, about AMD software or Rotom software. This is not working or that's not working. And I try to navigate and try to get get them the support that they need. So it requires a little bit of reeducating and asking folks that may have formed an opinion on AMD software to see where we are now. And that's always a little bit of a thing, right, because you've had an experience that was subpar at some point, but now you're asking for your second chance. Navigating that is interesting. And then just communicating what we are doing. Like, do do people know about Rockum? Unlikely. Right?

Do people know what's the best GPU to serve DeepSeek on? Unlikely.

But that should be straightforward. It should be like, the fastest way to serve DeepSeek is on AMD. That's like a fact, and that fact is just about how we market that and get that out. Right? So, hopefully, we'll work on that and help people kind of, like, realize the value of what they can get with AMD hardware and software.

And the talking about the speed of inference too just made me think about another element of the serving

piece is that oftentimes, you wanna be able to run multiple copies of the model if you're trying to do it for a number of different users

simultaneously,

which then brings in questions around being able to

split the models across multiple GPU instances or being able to

manage coordination of multiple GPU instances to pool together in a single batch of compute.

And, obviously,

AMD is being used in various super compute ecosystems. I'm wondering how you're thinking about that ease of use of being able to pool together multiple hardware units to be able to manage this inference serving or or for training.

Very good question. But, you know, I would actually take it a level lower. AMD is the industry leader in chiplet design. Right? Our competition is gonna do that in a year or two, right, where they actually get chiplets. The MI 300 today and MI 250s and, you know, like for a few generations, we have built multiple chips that are in one big package. And so what that provides is we are already doing that at a chip level. Today's MI 300 has something called the CPX mode, which gives you eight levels of eight chips with 24 gigs of HBM each that you can actually partition as an independent GPU, and you could have normal locality to the CPU.

It is actually,

like, physically, you turn off the connections and you have a GPU. Right? And so what that provides is the ability to build from, like, fundamental building blocks, even at the SoC level. And then that gets up to the GPU level, and you have eight GPUs at a a UBB level. And then you get to a a multi node

rack level, and then from a rack to a cluster level. And each of that kind of, like, inherently,

if you build it from a first principles of how you do communication and compute,

you start building a very, very robust, you know, software architecture. And there are challenges in how AMD had done it in the past that we are kind of, like, you know, have evolved over time. But today, going back to chiplet designs, you could actually partition each one of these GPUs, and we have instances of VLLM or SG LAN running on each one. So you have eight instances of VLLM serving Lama eight b, for example. And

it's fully isolated. Right? So you get eight extra throughput just by parallelizing,

you know, at, at the chip net level. So it's very exciting for computer scientists and for, folks that want to, like, just deal with operating system principles, like resources and allocation and and how you're efficiently using compute communication. Right? So AMD is built for that, which which is what I'd say. If you if you are in that space, we're obviously hiring. Drop me a note and blah blah blah. Well, it's it's a lot of fun activities to do and engineering to do with, with that.

And given all of the conversation we've had around the benefits of AMD hardware, the Rackham ecosystem,

what are the cases where one or the other or both are the wrong choice?

If there is a wrong choice, let me know. I'll make it the right choice. But,

I I think I think there are there are cases where, you know I I wouldn't say see, the Rockum software

should

become invisible. It should just become magic. Right? It's just it should just be,

hey. There's hardware resources, and there are APIs to do something. Right? And then you can do something on top. And what that something is

is like VLLM serving. And then on top of it, you build something else. Right? So so Rotcombe as a interface should just disappear. Like, people should not have to think about it as a moat because it's open source, and it should not be thought about as, like, buggy because it doesn't exist in your it's just there.

It's like ambient enablement of the compute,

and networking.

And then you get to the GPUs itself. GPUs themselves have their pluses and minuses. Right? So you've always had and this goes back to classic computer science trade offs of, like, hey. Do you do you need programmability, or do you need specificity?

Right? So do you go general purpose GPU, general purpose CPU, or do you go for ASICs?

You always have that, you know, trade off. And the

rate of pace of innovation of things keep most of that in like, GPUs is the sweet spot where that innovation will happen for the foreseeable future.

And then, yes, if you want to serve,

some specific customer with something, you wanna build an ASIC, it takes you two years, then you don't care that you you know, two years from now, you still wanna be serving OPT 1.7

175

b, fine. Build an ASIC that serves OPT

175

b. But it may be irrelevant. So then you try to make that a little more programmable. Then you're like, oh, but it's a little programmable, but it's not fully programmable.

Then it's a trade off. Then you're like, okay. You're you're

not in ASIC. You're not here, but you're gonna be halfway there. And then, you know, the halfway point is always, like, the one off benchmark. You look at some some of these ASIC vendors, they're like, oh, we serve Lama 20 x faster. But if you go read the fine print, it's like, yeah, Lama has to be only two k context length, while GPUs are serving, like, 1,000,000 context lengths. So you're saying all your requests will have to fit into, like, a little envelope, and then we'll serve these envelopes very fast. Sure. If that's what,

is needed for your customer, great. But but I think,

in general, like, it's it's classic general general purpose programmability,

and I think there's a market for that. And that's a huge market. It'll be good for that. But that doesn't mean we don't do semi custom or custom. Right? Like, you know, all of the big consoles use AMD. And so we are there the entire

journey and but that's on your hardware side or software side for AI.

As you continue to invest in the future direction and capabilities of the hardware layer and the software support to go along with that and the ways that the AI ecosystem is continuing to stretch and strain those software components.

What are the things you have planned for the near to medium term for your own work?

Near to medium term, I'd like to make Rockum

quality is really high on my mind. It should just be, like, nobody should complain about Rockum quality. That's number one. It should be easily hackable and usable. It should be, like, the hobbyists and developers need to end up there. Because why? Because it's easy to use, and I can just hack it. If something doesn't work, I'll go ahead and do fix it. Right? Like, that should be the mentality. But then we have developers with the developer outreach where we wanna go and get everyone to taste and explore and feel, oh, this is what it feels like to use Rocklin, and then go use that. So those are the three that I'd focus on. But then, obviously, on the other side, I wanna make sure customers are successful. Right? So that's that that's where, you know, rubber meets the road. If the customer is not successful, then that's, you know, not not good either.

Are there any other aspects of the RockAm ecosystem,

the work that you're doing at AMD, or the role of AMD in this broader AI ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?

I think this is,

a good

time to be in this space, and we are at an inflection point of both

hardware innovation, software innovation,

how we touch people, and what it means for, like, humanity at the top level. Right? Like, because increasingly, you're you're gonna be AI assisted in some form. Right? Like, composing an email to do something somewhere is is just you talking to your device or your chatbot to say, hey. Write this up for me, and and it's gonna give you most of the framework, and then you're gonna go and add bells and whistles. Right? So so it's gonna creep up on you in a way that is gonna be profound. And five, ten years from now, people would

assume that's how things are done. Right?

So otherwise, you know, from an AMD standpoint, hardware and software, it's just one step in front of the other, and, you know, we're just gonna keep executing and

and be there for this,

transition in, like, in in this industry.

Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.

I think it's a combination of all of them. Right? And

I think it will go down in like, when we look back after this transition data and this can be, like, twenty years. Right? That it can be over a longer

arc of time. It's going to require a combination of all three of them. Like, people will have to retrain to

be able to use

a a new way of, communication. Right? It's like emails, but it's assisted, and it already knows what the other person's asked. And you've already got a draft that's composed saying this, this, this, or this. Right? And and your job is, like, swipe left, swipe right on, like, okay, this draft or that draft. And and so

it may it may distill it down to that. So but then when it comes back to the fundamentals of, like, compute and power to power that compute

and AI

AI services, it's gonna be another layer of, like, learnings

of and development of new tools. The way I look at AI is it's like it's like electricity.

And now we're like, oh, here's electricity. Go do what you have to do. And there's a

whole wide world of, like, what we have explored and done with electricity.

And we are the same like, you know, we are just in the start of the AI era where we're like, oh, we got AI. Here are frontier models that can do this. And and then, like, is it is it Cursor and and,

what are Windsor for coding? Is it the next version of Tanvar to do AI based something?

How is salesforce.com

gonna

be working in an

AI agentic world? Right? Like

and that's up for someone to go innovate and and figure it out. So it's very exciting and and,

to be in this space.

Absolutely.

Well, thank you very much for all of the time and effort you're putting into broadening the availability

for people to expand the set of hardware that they're operating on and the ability to tinker with it and understand more about the end to end stack. It's definitely a very interesting problem space. It's great to see that AMD is being so open with their work there. So, appreciate the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.

Thank you for having me, Tobias. Thank you.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcast.init,

which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at themachinelearningpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hoststhemachinelearningpodcast

dot com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast