The Future of AI Systems: Open Models and Infrastructure Challenges

Hello, and welcome to the AI Engineering podcast,

your guide to the fast moving world of building scalable and maintainable

AI systems.

Your host is Tobias Macy, and today, I'm interviewing Jamie Daguerre about the role of open models in the AI economy and how to operate them at speed and at scale. So, Jamie, can you start by introducing yourself?

Sure. Hi, Tobias. Thank you so much for having me on the podcast today.

My name is Jamie DeGeer. I was founding SVP of product at Together dot ai, where I've been for about two and a half years now. Joined right around the founding of the company.

And prior to Together, I spent ten years working in startups in sort of similar roles, leading product management, marketing, and field technical services.

And then the second startup I was at was acquired by Apple, and I spent nine years at Apple, first leading the program management team for search instead of the engineering organization,

and then leading product marketing for

AI and machine learning across Apple for about five years before I came to Together.

And do you remember how you first got started working in the ML and AI space and why you've spent so much of your career there?

Yeah. You know, I I first started working on machine learning in earnest,

when we worked on search. We'd done some experimentation with

that before then at, Cloudmark when we were doing email and SMS security applications. We were using some machine learning techniques in in sort of our research, but we didn't actually use them much in production.

But then Topsy, we built a social search

platform, and we used machine learning quite extensively

for things like search relevance.

And after Topsy was acquired by Apple, that continued and expanded. We used machine learning for lots of different parts of the search stack on the back end. And the search technology quickly expanded to be used across other applications at Apple, like helping with Siri

and improving the understanding of your requests in Siri. And as I continued to work at Apple across AI machine learning in other areas, I worked on features like predicting which app you wanted next on your iPhone based on your routines and and many others.

So bringing us now to your current position, your current focus, I'm wondering if you can just give a bit of an overview about what it is that you're building at Together dot ai and some of the story behind how it got started and the problems that you're trying to solve there.

Yeah. Absolutely.

So at Together dot ai, we're

we're really providing a full AI cloud stack,

and we do that with a focus on

performance

and efficiency

to help customers achieve the best cost performance of whether they're building a model or customizing a model through fine tuning or post training or running a model for their production applications

for inference.

And so we call that the AI acceleration cloud because of our focus on speed and efficiency.

And,

we do this for

developers. We now have over 550,000

AI developers

using our platform for for these types of applications.

We also do this for foundation model labs. So not the, you know, OpenAI and Anthropic, but sort of a next tier

down from that. We service a lot of leading foundation model labs, whether it be Salesforce Research,

companies like Mistral, and and many others. And we're starting to also now do this for large enterprises. As they reach scale with running generative AI and production, our performance and efficiency becomes a huge advantage. And so we have customers like Zoom that uses our platform to train models that analyze conversations in a Zoom call and allow you to enable an AI assistant to chat about the history of those companies like Salesforce, which powers large parts of Agent Force,

using our platform. And how we got started, the the the we were we were started in

June of twenty twenty two.

This was after GPT three had launched, but before ChatGPT had come out. And we have a very deep research heritage in our founding, and we found that

a lot of the AI community was really exacerbated

with the fact that there was finally massive improvement happening in the quality of AI models, but suddenly it was being done in a closed way. And it wasn't done in the open with with sharing the techniques.

OpenAI wasn't publishing how they had built GPT three. And this was really frustrating for the AI community because everyone had always, over the last fifty years of AI improvements,

done so out in the open so that researchers could improve on past research. And the sort of feeling in a lot of the tech world was and the message was that, you know, we would have one big AI model, and it would become,

you know, artificial general intelligence. It would be only one model would ever do that, and that was the end. You know, that was the way it was gonna be. And we just thought this was such a fallacy. We thought that there would be many models from many organizations.

And importantly, if we could help to spur the open source movement around AI, a lot of those would be open source.

And when you look at major technology platforms over the last, you know, twenty years, open source becomes

a huge part of those. You know, the Internet runs on open source. All of the servers run on open source Linux, on open source web servers, on open source databases,

and other things. And we thought that that would be really important for this new technology movement

of AI to also have foundations and openness so that we can understand how these models behave,

how to build applications

that deeply understand the technology under them and build for that appropriately, and, you know, so that society

as well understands that. So that was what we felt the future should be, and we thought we could have a a small part in, like, helping to shape that and spur that as a movement. And if you imagine that future where there's lots of great models from lots of sources,

many of them are open source, to some extent, the models become a little bit more of a commodity. Obviously, you know, great closed source models will still have huge success,

but organizations will have the option to choose great open source models as well. And if the models are becoming a little bit more of a commodity, where does the value accrue? And we think that because it is so much more expensive

to build and run a generative AI application compared to building and running

a database and web server application,

a lot of that value will actually accrue to the infrastructure.

And so if we can be the best

at making it more efficient to use the infrastructure and providing faster performance out of the same infrastructure,

then a lot of that value could potentially accrue to us. And so that's really the kernel of how we started TogetherAI

with this focus on providing an AI acceleration cloud that provides the best cost and performance for just about any model. And we now have, you know, over 200 leading open source models available through the platform.

On that note of open source and openness in the model ecosystem, there's been a lot of debate about what that means semantically, what it means from a legal and licensing perspective.

And before we get too far into that, there's also the element of

who is producing these open models where

the initial releases of open models were largely as a reaction to the open AIs and anthropics where they have these frontier models that are proprietary and closed. You can only access them via the API. There's no real insight into

how they're built, how they're operating, and

you just have to rely on their infrastructure if you wanted to build on top of it, which carries with it a lot of platform risk. And so the initial

set of models that were released as open source,

again, delaying a little bit on the semantics of

that argument, those were mostly coming from the big tech companies. So the LAMA models from Meta were some of the initial ones. There were the Mistrala models that came out as a response there as well. And then also in recent months,

Google has been releasing some fairly popular open models. And I'm wondering

from that initial phase of two years ago when we were really starting to get these open models released and starting to gain some adoption,

how has the ecosystem shifted, and what do you see as the distribution

of who is producing those models compared to which ones are actually being used and leveraged?

Yeah. No. That's a great question.

You know, I think what is so fascinating

about what's happening today

in the

AI industry is there's innovation at many different levels. And traditionally,

like, the way we thought about things, the the the way that the messaging came out of these big labs was that

all of the value of creating

this intelligence

came from massive pre pre training. And so pre training is really taking, like, all of the Internet's data or all of the private data in the world that you can collect and building a very, very large model, a large base model from nothing, starting from random weights. And it is expensive. You know, the the

the the messaging from a lot of the big labs was that, you know, it costs billions of dollars to build one of these models.

And to some extent, you know, that's true of this pre training process. It really is very expensive. Probably not billions of dollars today, but it's certainly not something that just anyone can do. You need to have really large AI compute factories to to be doing that kind of training. But then a lot of gains started to happen from post training techniques.

Taking an already built pre trained model, a base model, fine tuning it, doing,

reinforcement learning from human feedback, reinforcement learning from verifiable

outcomes, which is a newer technique, and lots of other techniques in the whole post training process to modify this big base model to do new things or to improve it in some way. And this achieved tremendous

gains. It wasn't what the huge organizations would wanna focus on to tell you is the way they're achieving those gains because it's actually very, very inexpensive and easy to repeat and iterate in lots of different ways. And then more recently, we're starting to use techniques like inference time compute or test time compute to get these models to do more reasoning and sort of think while they're outputting an answer. And this, again, doesn't really require

large training processes

to get a model to start to behave that way. And so I think what you've seen in the open source community is that there are a number of very well funded, quite large organizations

that are investing in building really, really high quality

and, you know, large pretrained models. This includes Meta with the llama models. This includes large organizations

like Google and Microsoft.

It also includes

not for profit organizations

that are well funded, that are helping to do this, and some increasingly, a number of startups, startups like Mistral, which was, you know, releasing state of the art models with a team of 30 people through real innovation at doing this really effectively. And so they they create these large pre trained models just like the big labs do. But what's different is once that pre trained model gets released, the open community

rapidly improves them through other techniques. And so you get literally thousands

of different versions of the llama models created by the open community and shared openly

through platforms like Hugging Face.

And each one starts to be better in different ways, and each one enables the next one to learn from their learnings. And this has enabled both higher quality in new ways, but new techniques that have enabled organizations

to do this more efficiently and easily, which which has been a huge advancement that comes from that open source movement that the big labs are now copying, things like LoRa fine tunes and other things.

Delaying again the question about the semantics of open and open source, I think it's worth digging into your point of the proliferation

of these different fine tuned models, distillations.

The the number of models has been growing faster and faster, which also leads to a lot of confusion and challenge as far as which model do I use for problem x and the challenge of figuring out based on what you're trying to achieve, how do I even start to

sift through these different models and understand the implications of how they're going to perform on my particular problem set versus, oh, I there are too many options. I'm just going to go with one of the big model providers and use that, or maybe I start with the big model and do my own distillation, which then compounds the problem further. And I'm curious how you're seeing people

address that challenge of the paradox of choice that we're faced with now.

Yeah. I I think that that is super challenging, and it speaks to kind of you know, the the the biggest challenge in general working in this space is just the pace of change. Things are moving so quickly with new models,

new techniques

that it's very, very difficult to keep up and know exactly which techniques to use or which models to use or other things. And I think that this is one of the key areas we're gonna see improve in the coming couple of years, maybe the coming,

several months

is increasingly

the providers are gonna provide an AI system, an AI platform instead of just a model. You already are seeing this today with with OpenAI, for example.

Their latest model releases

are models that

are an AI system built to use data that they pull in through a RAG, built to use

tools as part of the tool chain of every request and, you know, compile code and look at the output of the code or access a mathematics

calculator or other things. And and I think that this trend will continue

and will start to move to a point where you obfuscate out which model gets used. The AI system will actually choose the model for the task, both at a request level and on a level of helping an organization say, okay. Let's say, for example, that you are a

insurance company, and you've got a model that is intended to interact with customers

around insurance claims, and you've done some work to fine tune one model. Let's say you started with Google's Gemma model. You fine tuned that. That model will get used in production, but it will get monitored and evaluated

in an automatic way on an ongoing basis. And as new models come out, let's say, Llama just released Better just released the Maverick, Llama four model, that fine tuning data will automatically be applied to the Maverick model. You'll have a new candidate model that is now available in this production AI system. A subset of requests will sort of be sent to the new fine tuned model in sort of a shadow mode and evaluated. And then if it's achieving higher accuracy,

either through prompting the administrator or developer running this system to approve starting to use it based on the evals or eventually automatically,

the AI system will actually switch the models. And so you'll you'll start to have this system that is constantly creating and evaluating new candidate models for you based on your fine tuning data, based on your production traffic, based on new models that are coming up from the community, and just greenlighting the one that's best for the task and acting as a total platform and system as opposed to saying a specific model.

And now exploring that complexity

of the terminology being used around these open models as far as the first set of them were very,

freely called open source. And then there was a lot of pushback saying, no. It's not actually open source because all we have is the model and the weights. And maybe we have the code, maybe we don't. Most of them, you don't have the data. And so now there's been enough iteration that there is actually an OSI approved definition of open source in the context of these AI systems.

And, again, there's variety of levels of compliance with that as far as the models that are out there. I'm wondering how you're seeing the current

language used around the idea of openness and the generalized understanding of what those different terms mean and the context of what types of rights and freedoms are being granted to the users of those models.

Yeah. I think it's a great point. I think that there's a lot of variety here, and, you know, this is not new to the open source community. The open source community has grappled with this since the very beginning in terms of, you know, different licenses, whether it was, you know, the MIT license or an Apache license or,

licenses that were not as,

legally permissive

for different use cases.

And there's a new layer of complexity with this when it comes to the AI models as you're pointing out because the model could be open weights, meaning that the weights are available for anyone to use with a certain license attached to it. And you can download them and you can use them on your own system. But the source code of how the model was created might not be available. And so those are sort of referred to as open weights models, but not open source, technically speaking. The other aspect is that, like you mentioned, the data that the model is trained on. You know, a fully open source

model release should really include something that you can reproduce.

And if you can you can reproduce it, you would need the code that was used to train it. You would need the resulting weights that were outputted, but you'd also need the data that was that was used to train the model. And so that's sort of an open data model. I think it's helpful that we create these distinctions.

I think that it is helpful that different organizations do this at different levels. You know, some organizations do all three, which really helps the research community learn even more from the release of what's being put out. But I don't think that we should think about this in terms of, you know, if an organization doesn't release the source and does not release the data, then it's not a truly good open release. It's not helpful. That's definitely not the case. Just having an open weights model is tremendously helpful to the community, and it enables

so much of an ecosystem to occur around that. The meta Meta's llama models are just open weights models. Their source code is not released. The data training data is not released. But they they still enable a tremendous ecosystem to build on top of those models and use those models. And so I think that all of these levels are are valuable, and it's great for there to be a variety from the open community.

From the perspective of organizations

who are investing in the adoption

and use of AI for

their business systems, whether that's internal or customer facing.

What do you see as the typical

phases of evolution

from the initial set of, hey. I've played with OpenAI. This seems great. Oh, wait. I don't wanna take on the platform risk of some proprietary company deciding that they wanna yank out some feature from under me. So now I'm gonna go to these open models up through to, hey. I've built this self evolving AI system. It's doing the what you were saying before as far as the automatic fine tuning, automatic candidate generation, automatic shadowing.

Obviously, there are a lot of steps and a lot of layers of complexity on that path.

And then even eventually to the point of, hey. I've got enough of my own source data. I'm going to build my own

foundation model from scratch. And I'm wondering how you're seeing organizations

tackle that great that that evolution of complexity and sophistication

on that path.

I think that there's kind of two vectors of this. One is on the achieving model quality, and then the other is

on dealing with operational scale and performance and and hosting.

I think on the achieving model quality, I think what you mentioned is very typical where an organization will start with a basic prototype

using a model like OpenAI or Anthropic Cloud. Very easy to get started. These models tend to be of the highest quality out of the box, and you can quickly prototype to see if the application

makes sense and if AI can be leveraged for the task.

Once you see that it can be, a lot of these organizations run into challenges. One of the challenges is, as you mentioned, being beholden to a single vendor and single platform, and it could change under you. Another is the cost. You know, if you're starting with the the biggest model, it's often the the highest cost. Another is the performance and scalability and reliability.

It's one thing to prototype for, you know, a thousand employees. It's another thing to run it out roll it out to millions of consumers. And so all of these reasons often lead them then to saying, okay. I want to invest in tuning an open source model to my application and be able to operate this with more control and ownership

at higher skill and lower cost. And so they switch to open source models. And I think on the accuracy vector, that usually starts just with prompt engineering in a similar way to how they probably started with the closed source model. Then the next phase is is typically adding data to the prompt through RAG with retrieval augmented generation so that you can imbue knowledge from your internal enterprise into the context that the model uses to respond to a request.

And in many cases, that is all you need to kind of get a successful application.

In many cases, organizations

go further to do a combination of RAG and post training or fine tuning of the model, which can help to adjust the behavior of the model or also imbue some knowledge into the model at the time of the training. And we usually see, like, for the most advanced deployments,

using a combination of both is the most successful. I think that the next stage we'll start to see more and more is sort of a regular pipeline for that constant improvement. Today, that's kind of done through human methodologies of, like, research teams having a set of sprints where they kind of retrain and and experiment. But increasingly, they will we'll see the AI platforms help customers do that more and more easily. I think really quickly, the other vector

really is around

the operational performance and scalability. And I think often we see organizations when they first start to switch to to open source, they'll self host on maybe

their their existing cloud provider,

like AWS EC two instances or something along those lines using an open source

inference system like, say, VLLM or SG Lang and operating it on their own. And this is a great way to get started using these open source models. But as they start to deploy into production,

that's a typical time when they they come to us and say, like, this was tougher to operate at scale than we expected. There's lots of issues that are that are challenging that we haven't developed approaches for yet, and the overall cost is really, really high. And by switching to a platform like Together.ai,

we take care of all that operational burden for them and are usually able to give them better efficiency where we reduce the amount of GPU compute needed by half or a third or or or sometimes even more,

reducing that cost and improving that total cost of ownership.

And as far as that

transition point from I have

moved from the proprietary model. I'm now using open models. I've got my RAG workflow in place, and everything is working to then saying, okay. Now I'm going to take the next step into actually fine tuning my own models or even progressing further to

building,

your own foundation model. What are some of the heuristics that you're seeing organizations

make as far as whether and when they actually want to take that next step beyond just being consumers of the existing off the shelf models and into actually investing in their own capacity and capability to generate new models either from base models or from or from whole cloth?

Yeah. That's a good question. I mean, I think first is kind of

a organization level,

strategic level heuristic,

not a technical one, which is sort of how strategically important is our investment in generative AI. And for a lot of organizations, this is a huge strategic imperative. You know, an insurance company

may view

it as existential that they become the leader at leveraging generative AI most effectively

to

help automate the process of making the best predictions on whether or not to insure someone or the cost to insure someone on or a location to insure in or not and other things. Because if they don't become the best at that quickly, their competitors will

and they will be at a tremendous competitive disadvantage and and sort of lose the market. You know, that's kind of stated in an extreme way, but I think that that is a reality for some of the impact that generative AI is gonna have on so many industries.

And so for this sort of first heuristic,

that question becomes like, is this critically important to our organization

that we strategically invest in generative AI

and

start to create the muscle to be great at leveraging generative

AI amongst our internal organization.

And if it is that strategically important, they

want to get good at doing that process. They want to own the result of their investment in generative AI and control it. They want to be able to repeat that process quickly when new models come out. And,

often that then leads to wanting to fine tune and post train a model that they own the resulting model and control it and that they can have the the the muscle, the team, the staff, etcetera, to be able to repeat that process. I think the second heuristic is a technical one, which comes down

to what type of change in model behavior are we trying to create. If the change is solely

kind of information

based or knowledge based, like this model on its own

doesn't know about

our products

or our internal

process documentation,

then then RAG is very well suited, obviously. You you can pull that information as long as that information quantity is small enough that can fit into the context of a request and you have a good enough search capability to find the right documents. You can pull that in before the model actually processes the request

and imbue that knowledge at at request time. But if the amount of that knowledge goes beyond sort of what you can put into context or you're trying to teach the model sort of new reasoning technique or get the model to kind of understand

broadly

how your whole industry functions or something of that nature. Let's say it's like weather prediction

or, you know, insurance prediction could be a good one. Like, that's not just like you need to have the weather data for the past two years in this location. You actually wanted to understand the impact of weather changes on insurability.

That's kind of imbuing new reasoning capability.

Or if you're just trying to change the behavior of how the model behaves in terms of how it communicates and how it outputs. Sometimes that can be done with simple prompt engineering, but sometimes post training does it more effectively. And so this sort of heuristic of what type of change you're trying to create in the model behavior becomes the other main reason for when to decide whether or not to do post

training. Another

major

avenue of conversation

that's happening

both in the open and proprietary space, although, obviously, more transparently in the open models, is the type of architecture that's being employed for the building of these models, particularly when it comes to inference time efficiency where the transformers paper was what catapulted

us into the current generation of generative AI that is the predominant architecture of most of the models that are in this space right now. But there has been a lot of conversation

in recent months about alternative architectures that are better for either training compute or being able to run on other types of hardware that doesn't necessarily lock you into

the NVIDIA ecosystem

as well as the efficiencies at inference time or the ability to have a smaller number of weights that produce a,

more outsized

capability compared to the similarly weighted transformer models.

And I'm curious how you're seeing that start to take shape in terms of which models are being built, which models are being used, particularly for cases where efficiency

of compute and cost are paramount for a given use case.

Yeah. I love this question, and it gets me excited because I think this is a core area of really research contribution

that that Together dot ai has made. And one of our founders, Chris Ray, leads a leading lab at Stanford called Hazy Research.

And a lot of these new techniques have come out of his lab at Hazy Research and some of his team. And

I think it's such early days. You know, the transformer has achieved incredible things with the ability to do these huge models at scale

and achieve

tremendous

qualities in terms of the accuracy

that they're able to achieve on tasks.

But I think it is early days, and we're gonna see

more and more efficient architectures for models

over time that get used to to improve

the compute time characteristics at either training or inference

that achieves those same quality levels. One of the areas of research has been around using a different type of model architecture called state space models for these generative AI models. State space models weren't traditionally used for generative AI models, but there's techniques to use a state space model which has, you know, transformers have a quadratic performance characteristic.

State space models

are sub quadratic, near linear performance characteristic, which is dramatic difference in the compute that is needed to to do things like attention. And so there's model techniques like Hyena and Mamba that use state space models

to enable much better performance characteristics and much larger

long context

to, we've shown, have been able to achieve roughly on par quality and accuracy of the end model output. And

increasingly, some of the big models are starting to use these techniques. You don't have to necessarily use that for the entire model. You can use a bit of a hybrid architecture where you have parts of what the model does and its behavior using transformers and and parts that use a HANA or

Mamba or state space model technique. Another technique that's come up more recently is actually from one of the advisors of Together dot ai that that has now started a new startup for using

diffusion based model architectures

for large language models. So text to image generation models like Stable Diffusion

and Black Forest Labs, Flux models, and other things are typically using diffusion based models. But what this research has shown is you can use a diffusion based technique to create an LLM, and it achieves dramatically better performance characteristics. And so I don't know what the future model architecture will be, but I suspect

it won't be a traditional transformer. And we'll have new architectures

that enable us to train more efficiently and run these models at inference time more efficiently, which is really exciting because it then further lowers the cost and makes this technology even more accessible.

And then moving on to the complexities

of inference time compute, obviously, that's your whole business. You have a lot of domain knowledge around that. You also mentioned that part of the evolution of organizations

moving from proprietary to open models. They will typically

try to run their own inference using something like VLLM or SG Lang or one of the other inference engines that are available.

And I'm curious,

what are some of the

sharp corners, some of the challenges that organizations

run into trying to self host inference time compute

and some of the reasons that they decide to offload that onto a platform like Together.ai?

Yeah. This is kinda one of those things where there's, like,

a really clear tactical

tip of the iceberg that's sort of above above the water level that you can see and expect. And then there's this massive rest of the iceberg that that is kind of under the surface and a little bit more fuzzy to describe. But above the surface, I think that these models, particularly when it's a large model, are very costly to run. They require

significant amounts of, you know, high end GPU compute,

and and that is quite expensive to run and operate. You can see that even from the start when you're starting to build one of these applications. The second thing that's sort of above the water level is that load can be unpredictable for these, you know, in any Internet service, any Internet scale scale service.

And there's

lots of knowledge and dealing with, you know, dynamic load in the operations

space of of these Internet applications. But I think that you go from needing to deal with varying load characteristics

on a CPU application where the CPUs are commodities and incredibly cheap to elastically scale to doing so on a GPU level application where you can't get really on demand GPUs added to your

application where you can't get really on demand GPUs added to your cluster because on demand is always sold out. And there's no elastic compute for for GPUs typically in these clouds. And it's a lot more challenging now to then deal with how you deal with these load characteristics, how you plan for the peaks. And if you plan for the if your peak is eight times your normal load, are you paying for eight times the compute a % of the time, which is awful and tremendously inefficient and expensive? And so if you don't have other jobs to be done on those GPUs, how do you get more efficient use of of that compute? So these are kind of, like, really hard and challenging things, but they're kind of expected and predictable as you start to think about some of this and you as you start to run into these issues. I think what makes it so much more difficult is the things that are a little less expected, which is the pace of change

fundamentally

because the techniques to run these models, the techniques to run them efficiently

are changing so rapidly.

You know, there's a new model coming out every month, and so you're operating one model in production, and then your team that's kind of doing the post training and and such says, oh, we've we've switched.

We've switched from this, you know,

8,000,000,000 parameter

model,

now the 23,000,000,000

parameter model from another,

company with a completely different model architecture

is achieving way better gains than the application. And the user engagement went through the roof. There's four times the user request when we experiment with this other model. So now we wanna roll this out to production. It's like, oh my goodness. Well, this is much more expensive and difficult to run. It doesn't work with the inference engine that we had yet because it doesn't support this new architecture that this 23,000,000,000 parameter model has. And you said that the engagement went up by four times with users, so you want me to have four times the capacity on this model that's, you know, three times the size? That just explodes

the complexity and the challenge. And then other techniques come in like disaggregated serving, where, you know, the two two main phases of inference are prefill and decoding where you're kind of loading in the the input of the request and mapping it to the weights of the model during prefill, and then you're generating the response during decode phase. And, traditionally, this was all done on, like, a single node. But today, that's done

on separate clusters that you scale independently

with different numbers of GPUs.

And developing the system to be able to do that is challenging from a development perspective.

And then from an operational perspective, it brings even more complexity as well. And so this is all, like, everything I just mentioned probably could have happened in the last, like, six months maybe. And so it moves very, very, very quickly. And so it's not only being able to plan for operation at a point in time of technology. It's planning for operation

at this dynamically moving

pace of AI and the AI space.

And I think that that is the, you know, the rest of the iceberg that's under the water that makes it so much more challenging than you would expect when you kinda go in eyes open thinking about the expected challenges.

And and you the experience

of the Together AI team and the requirements around being able to fulfill those use cases and scale

those operational

characteristics

across multiple different users with an even higher degree of variability in terms of usage and model selection.

What are some of the most significant technical and strategic

investments and challenges that you've had to overcome, particularly in the past six to twelve months given that that the rate of change that we're dealing with?

Yeah. And I think that that's a great leading because I do think that that is the biggest challenge is this sort of pace of change. And I mentioned some of the aspects of that pace of change in terms of new models, new techniques

for the inference engines. There's other parts of that as well, though, like new

infrastructure,

the new versions of GPUs coming up from NVIDIA and potentially others

with new architectures, different sizes of memory

that, you know, require totally new optimization

and and techniques for how you operate, new storage infrastructure

with, you know, GPU direct storage being available and other things and new techniques for how to leverage that storage across an inference cluster. The networking is changing rapidly.

And so I think that, you know, for me as

I'm not the engineering

leader or research leader, I'm on the, you know, the product side. There's an organizational

challenge to how to kind of build and prepare and and have process for how we

are setting ourselves up to be able to move the fastest.

One of the things we've built recently actually is interesting is, you know, we try and build kernels that provide the fastest performance.

So, you know, an attention kernel that can can do the attention calculations faster than NVIDIA's kernel, for example. But we've really been building this into our organization

to not only build something that's faster, but build the ability to build it faster. So we wanna be able to make new kernels very quickly. And an example of this recently was NVIDIA's been working on Blackwell

for at least a year or more.

And internally

NVIDIA, they've they've obviously had access to the plans for Blackwell instruction sets and whatever other things that there might be.

We got our first Blackwell chips. And within a week of access to those, we released a new kernel that was outperforming

the main kernel from from NVIDIA for for the attention calculation, not by a huge margin, by, like, 2% or 3%,

but was done in a week. And, this is because we've developed this harness, this platform for building new kernels quickly. And so I I think a huge part of what we've focused on at Together Ai is not just the ability to make things fast, but the ability to adjust and move to something new really quickly. In our human processes and our harnesses and techniques that we have for doing it, our inference engine is the same. You know, the the inference engine from some other organizations is written in native code in c and c plus plus and it ekes out the most performance in many respects. But when a new architecture comes out for a totally new model, it takes the organization, like, a month or more to support it. We need to have support for a new model in production, like, day one when it comes out, usually in hours of when it comes out. So we built our inference engine mostly

in PyTorch

and some components in in Rust and built it to be able to move to new architectures really, really quickly when they come out. We had DeepSeek R1 running in production in twenty four hours, for example. And so this is this is a key part of, I would say, the challenge that we've we've really had to focus on at TogetherAI.

The model architecture question is another interesting one too because you see from all of these various tool providers

of, oh, hey. Here's my new release, and the release notes is, hey. We added support for Google Gemma or,

be or, you know, hey. We added support for DeepSeek r one. And then also looking at some of the desktop tools. So LM Studio is one that I use fairly regularly, and it says, oh, well, here's the g d u f encoded model or, you know, an OLAWA that needs to be, you know, a certain distillation or a certain quantization

to be able to run. And I'm curious how you're also seeing that impact the complexity

of model selection and runtime selection,

particularly for nonproduction use cases of, hey. I just wanna be able to use a local model to be able to chat with my documents on my laptop, or I wanna be able to deploy a model

to a Kubernetes environment so that I can use that as a private copilot or something like that.

What are some of the main

aspects of that model architecture that teams should be aware of and ways that they should get that familiarity

to be able to understand what are the distinctions when they're doing that model selection?

Yeah. I think this speaks to, like, just all the layers of the of the complexity

that

it's not as simple of, you know, is it a transformer

or or or or is it a a a different model architecture?

Even within something like transformers, there's a huge variation in the,

techniques that can be used,

and the architecture that the model might be used using. And then

beyond just the the core model architecture, there's techniques that are applied to to the model like quantization

that make it more efficient to run, but might

have,

an impact on the the quality of the model and the accuracy that you receive. There's techniques like distillation and and many others. And, you know, I think that today,

keeping on top of all of those

is pretty necessary to build these applications well at scale.

And

doing that yourself, you know, as an engineer or an engineering leader or or even a researcher

is is very challenging. So so starting to work with an organization that can help you and support you on that, and they

an organization that makes that their full focus and purpose in life and they've built teams around can be immensely helpful.

I think

also that's why, again, I I really feel that

a lot of the evolution we're gonna see in the next

year or more

will be in

these platforms

becoming

more and more of a system

that will reduce how much of that complexity you have to deal with as an individual

developer or or researcher

so that

it supports these variations

and automatically helps you to optimize within them without having to go quite as deep into understanding and and building for each of them.

And in your experience

of

building and growing the Together.ai

platform and working with your community of customers and the open model community, what are some of the most interesting or innovative or unexpected ways that you've seen either open models or Together.ai

or that combination applied?

Great question. Yeah. I I think that's one of these things that's so exciting about working at this level of the stack is that

the applications

are

are kind of endless with this technology. So we have companies that are building

biotech models for new drug research on our platform.

We have health sciences applications

using existing models to analyze

ultrasounds or x rays through through post training and fine tuning and achieving

these incredible accuracy rates that are

helping hospitals and and doctors

get dramatic efficiency and quality gains in their work. We have companies building text to video models like Pika, where you get these incredible consumer applications and really fun, silly kind of videos

out of a simple prompt really quickly that are that are really engaging. And then I think finally the one of the biggest

surprises,

now it feels

not surprising at all, but I think, you know, two years ago, we wouldn't have expected it, is is coding. You know, these models have just been tremendous at coding.

We have multiple companies that are building

integrated development environments for customers that use our platform for running a lot of the inference for these these coding applications.

And the productivity that we're getting to for our engineers

and the amount that these models can take care of for for coding is is just tremendous. And so all of these have been, you know, surprising and shocking over the last two years.

And in your own personal experience of working in this space, trying to stay up to date with everything that's happening, understand what direction to

product, and how to

manage the messaging around it to your potential customers? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. Great question. I think that one that I would say is, you know,

no change is durable.

You you can

be so excited about one improvement that's gonna be made or one model that's gonna be released that's gonna

be the next great thing. But you have to expect that

it's going to be so quickly

competed with with something that is going to be even better, whether even if it's from, like, an open community and not, like, an actual another company or whatever else. But

just starting to change your frame of mind that this is

it's really not about

about building

one solid point in time release that is going to, you know, last for a year or something until the next release.

I think that that's something that it has to be much more of a living constant evolution with tons of these improvements happening,

and you have to

be very fluid in adjusting to what's happening from the industry and and others in this space

to be nimble and adjust to that in your in your roadmap and your plans. And I think this is the same for organizations adopting

generative AI for their own use. Don't think of what you're building as you're gonna build a model that is tuned to your your needs and achieving the accuracy you want, and you're done for the next year. It's in production for your customers.

Really think of it as you're building a muscle. You're building this ability

to be

on the latest, constantly evolving

and iterating. And, you know, you should be getting to the point where new iterations of your model are coming up very regularly,

whatever that regularity is. Maybe it's a month, maybe it's a week, maybe it's an hour

eventually on an automated harness that's just getting e baled. I think that that is much more the way that you have to think about this space because it's still early days, and there's constant innovation and improvement coming from all sides.

And are there any other aspects of the work that you're doing at Together.ai,

the overall space of open models, inference time compute, the challenges of building and fine tuning these foundation models, or just anything else about the work that you're doing that we didn't discuss yet that you'd like to cover before we close out the show? You know, I think the last thing I would just say is we're tremendously excited for what's happening in the open community.

And the pace at which open source or OpenWays models have caught up to the biggest best closed models

is much faster than I thought it would happen. And we founded our company on the thesis that this would happen. So we're I was one of the most optimistic

people that thought this would happen, but it it's much faster than I thought. It has shocked us how quickly

new models have come out that are really achieving the same quality as the leading closed source labs. And, you know, one of those releases and it's shocking. Right? It's like, you know, DeepSeek r one release. And it was a tremendous

gain above and beyond what had ever been achieved in an open source model, achieving the same as o one or o three in in many respects on a lot of the accuracy measures. And within three months, you know, two other organizations

release models at the same level,

as open source models as well. And so you have this democratization

that's happening

of the ability to provide these models and and leverage them in in new ways in the open community that I think is a great direction for AI for us to better understand these systems and be able to leverage them more deeply and know how to invest in applications for them in our organizations.

And it shocked me and really exciting how how quickly it's moving.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Together AI team are doing, I'll have you add your preferred contact information to the show notes. And as the final question,

I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.

The biggest gaps.

I think that the biggest gaps are you know,

it goes back to a lot of what we talked about today. But so much of the way we work today is still being thought about in terms of

building for one model and optimizing for one model. And

with the pace of improvements to models and new iterations of models

being so rapid, I think that that is

a gap that's gonna need to shift where

better tooling comes out to help you be able to constantly evaluate different versions of models

and automatically tune multiple models

and get help with from the system, from the AI system,

get help with figuring out which combination of models achieves the best outcome for your application. I I think that that is

the next sort of iteration that's that's needed to make building and developing these applications on AI much more robust

and easier to to manage.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Together.ai

and your thoughts and experiences

in the open model ecosystem.

It's definitely a very interesting and, obviously, very fast moving space, so I appreciate you taking the time to share your thoughts and opinions and expertise and help all of us, figure out a little bit more about what we should be doing. So thank you for that, and I hope you enjoy the rest of your day.

Absolutely, Tobias. Thank you so much. Really great questions, and I I really enjoyed this. Thank

you.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcast.in

it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machinelearningpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast

dot com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast