Inside the Black Box: Neuron-Level Control and Safer LLMs

Hello, and welcome to the AI Engineering podcast,

your guide to the fast moving world of building scalable and maintainable

AI systems.

When ML teams try to run complex workflows through traditional orchestration tools, they hit walls.

Cash App discovered this with their fraud detection models. They needed flexible compute, isolated environments, and seamless data exchange between workflows, but their existing tools couldn't deliver.

That's why Cash App relies on Prefect.

Now their ML workflows run on whatever infrastructure each model needs across Google Cloud, AWS, and Databricks.

Custom packages stay isolated.

Model outputs flow seamlessly between workflows.

Companies like Whoop and 1Password also trust Prefect for their critical workflows, but Prefect didn't stop there.

They just launched FastMCP,

production ready infrastructure for AI tools.

You get Prefect's orchestration plus instant OAuth, serverless scaling, and blazing fast Python execution.

Deploy your AI tools once. Connect to Cloud, Cursor, or any MCP client. No more building off flows or managing servers.

Prefect orchestrates your ML pipeline. FastMCP

handles your AI tool infrastructure.

See what Prefect and FastMCP can do for your AI workflows at aiengineeringpodcast.com/prefect

today.

Unlock the full potential of your AI workloads with a seamless and composable data infrastructure.

Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most, building intelligent systems.

Write Python code for your business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.

With native support for ML and AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions.

Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch.

Build end to end AI workflows that integrate seamlessly with your existing tech stack.

Join the ranks of forward thinking organizations that are revolutionizing their data engineering with Bruin.

Get started today at aiengineeringpodcast.com/bruin.

And for DBT cloud customers, enjoy a $1,000

credit to migrate to Bruin Cloud.

Your host is Tobias Macy, and today I'm interviewing Vinay Kumar about strategies and tactics for gaining insights into the decisions of your AI systems. So, Vinay, can you start by introducing yourself?

Hey, Tobias. This is Vinay. Pleasure being here. My name is Vinay Kumar, as you introduced. I'm the founder CEO of Ira dot ai. We were one of the first deep learning startups started in 2013.

Currently, we have we have established a frontier AI lab called Lexi Labs, which is in Paris, London, and Mumbai.

We do a lot of India work around AI interpretability and alignment. So we're looking forward to discuss more

in the conversation.

And do you remember how you first got started working in the ML and AI space?

Oh, yeah. So this is almost twelve years back now. Started off in 2013.

So this is actually an extension to my research at that time. So way back in 2013, so my problem statement was to build an AI system

around continuous manufacturing. So I was exploring classic machine learning at that point of time. 2013 is,

you know, very initial phases of, AI and deep learning. I stumbled upon,

the deep learning papers at that point of time. The AleksNet paper, which was kind of getting a lot of buzz at that moment,

That got excited,

us,

getting into,

just starting with, AI to begin with as part of my academic thesis.

Later on, we were, I was very excited about what, this can expand into. So that's how I started Otter dot ai along with my co founder. Initially the problem statement was to build an intelligent assistant to researchers.

We thought the search was, dumb at the moment,

at that time, like for researchers,

if you remember, we were using things like Scopus or Google Scholar, which was simply keyword search. They are not doing anything smart at that point of time. And we know that there are so many papers that we have to go through before we can finalize the topic or the literature that we would have to go to. I thought it'd be really, really interesting to have, some kind of AI assistant

that would have read all the papers and that can become my special professor or assistant to me that can help me to do my research. So that's how we started. This is

extremely early phases of, deep learning and AI. So since then, it's been twelve years.

I only see that problem getting solved three years back with Chargebee, I guess. So it's time does evolve. Yeah.

And so digging now into

some of the focus on the interpretability

and analysis

of the decision making process of these models, before we get too deep in the weeds there, I'm just wondering if you can give your definition of the term explainability

and even more relevant these days, alignment in that context.

Got it. Again, so this is primarily our learnings that we have we were kind of constructively

now very clear. At that time was about four or five years back. So we have been deploying

complex AI models, like in some cases using deep learning for a very mission critical problem statements, like in insurance underwriting, in banking, for example. We had a discussion with the head of business at that time. So this is probably 2019,

2020,

where we used our deep learning to build an underwriting model.

And we were showing that, you know, the model works really fantastic.

Then there were a bunch of questions,

for obvious reasons. One, they have to justify

to the regulator in terms of how the system is making these decisions.

And then, also for themselves to get the confidence in terms of how exactly is the model able to predict. So this is when, we were using

parse, or readily available methods like sharp online

at that time. But we realized very quickly that, and also none of these methods are good, nor accurate, not true to the model.

Meaning sharp lines, so these are necessarily used as approximations

to try to

expect or predict how the model is functioning. But, again, that is very much dependent on the hyperparameters, like what was your baseline and, number of iterations,

which means you're not necessarily explaining the model. You are simply building some fake parallel model to try to give some kind of explainability.

It became very clear for us at that time. So we were like, okay. This looks like a problem. Meaning, I'm very confident to build very complex systems. If I do not get it accepted by the end user, then there is no point. Right? Meaning, I can build more and more complex systems, which is more accurate. But if it is not interpretable,

then it will never scale. It will never get into full potential. So we get started on that problem in terms of how do we do that. And, at that time, alignment was not a big problem because the models were little less complex, very definitive. So this is way back in 2020. In 2021,

we came up with a technique called DL back trace, which is called deep learning back trace. It's a new method to try to interpret deep learning models. We filed a patent in 2021. It was not released in public at that time. So what it does is it can able to trace back the relevance of the prediction back to your entire network and back to your input. It doesn't require any base any retraining for a specific model. It works for any kind of deep learning architecture, including LLMs as well. So it it was a good innovation and and creation at that time.

But later on, once models are getting bigger, larger, massive meaning, our again, going back to our thesis and goal. Right? So we want to build systems which are interpretable,

acceptable, and safe to scale because we probably work in, very mission critical, highly regulated industry. So as when these models are getting complicated and complex, we realize

that guardrails or risk policies are never going to scale or is not it's very hard to impose them on a model.

Meaning currently, if I want to make a system safe, people do very, you know, outside the model applications like applying some kind of cordial, applying some kind of rule or a risk policy, for example. But there are now enough examples to say that, you know, these can be jailbreak and this can be tweaked and this cannot be scalable at the moment. So this is this is where,

things get really complex. And we realized that again very early on in 2021, 2022, we realized that the models are getting complex. If we have to scale them to a very large volume, one interpretability is one part. Second is the alignment. Meaning alignment can be defined as the ability to able to deliver or make sure the model is within the business requirements or within the user requirements to be very broad. It could be societal alignment. It could be bias alignment. It could be risk alignment or safety. It could be any kind of alignment factor. Right? So how do I align these models? So that's pretty much it's kind of made us to realize that, you know, so these are the two problems that are very hard and quite important to solve if we have to make AI accepted or scalable to all set of users, not just to a limited number of use cases, which is what the current problem is. Right? I may have a very complex system. If I don't scale these two things, it will still remain in a lab. It will still remain to a very low set use cases. It will simply become like a very underutilized

capability or technology. So that's pretty much is how, it they it it became the main theme for our lab in terms of how we want to get to, let's say, superintelligence or safe intelligence. We want to make sure these are interpretable and aligned when we get them. And so you dug a little bit into the DL back trace library. You mentioned Shapley values. I'm wondering if you could just talk a little bit more about the predominant methods that teams rely on for being able to gain insight into the ways that their models are operating

and particularly as we move into a realm where these foundation models are the majority stake for

interaction with AI capabilities,

how that changes the ways that we think about

what the model is doing and the types of measurements that we need to do to ensure that we're getting the desired outcome from those models.

Interpretability is very much or explainability.

So,

interpretability is technically

from a model point of view. Explainability is more from a user point of view. Right? Meaning, how do you explain a model to, let's say, a business user versus a regulator, for example? That's a different story. So it all depends upon who the user is. Right? Or or at least it all very contextually

starts with who the user is to begin with from what does interpretability

or explainability means to them. In fact, we recently introduced, we have, released a paper a few months back, which is our ability to standardize

explainability and explainability

evaluations

is paramount to achieve AI governance. Or else what happens is people have been using these gaps in the model, in the technology to able to tweak

their agenda

and their point of view as the model's point of view. Meaning, model could be biased. I can hide it today. If we even if you use things like sharp online, I can do scaffolding attack, and I can show that model is not biased. Right? And and, likewise, model could be racist.

Model could be doing, you know, very abnormal behavior, but I can hide it because,

I am the one who who is giving you the explainability, and there is no way to benchmark it. It's very much hypothetical to the user. So these are exactly the problems that I've we have seen with the current methods. Now so what are these methods? Right? So it it all again, it's very much a model technique dependent. For example, if you're using classic machine learnings, like, let's say, gradient boost methods, for example. So they are intrinsically explainable to some extent because I can look at the tree and and and and exploit the tree for each production, or at a global level for the entire model. So for classic machine learning systems, there are methods to do it. But again, these are not, if you use things like shop online, so people have realized that they have been exploited, they're not good. And there are other methods like surrogate methods, like contrastive explanations.

It doesn't act actually explain the model. It only explains the boundary conditions of the model. You have contrastive explanations and then you have, prototypes as explanations.

Again, it tries to dwell into how the model is working, but, again, it's in the point of a group categorized data to tell you that this category of data is what made the model to predict this outcome. And then you have more nuanced ones like Grad CAM. You have integrated gradients, for example, when it comes to deep learning models,

where you track, or trace the activations or gradients and say, okay. Find this neuron as activated and so on and so manner. So which you can plot back to the input to say, okay, these edge futures features

are responsible for classification and all that stuff. So the interpretive methods are very, very much dependent on modeling technique. Let's standardize that to deep learning because deep learning is not a common theme even in LMS at the moment, for example. Right? So I could use, let's say, a Sharpen line even in deep learning. I could use IG, Grad Cam, Code Grad, these kind of methods if it is a deep learning model. But again, the problem is how scalable are they when the model grows in size and complexity and, you know, quantum, from that point of view. Right? So let's say if I'm using something like an LLM, so what method can I use to interpret it? And what am I interpreting in that case? Right? So people typically use activation maps to try to explain how the model is functioning for an LLM because activation maps are anyways generated during inferencing. So it's easy for me to plot them and so I'm sure that these are the layers or neurons which activated more when this prediction happens. But the problem is it necessarily doesn't say anything. Meaning, these neurons are activated, but it doesn't say is it positively influencing or negatively influencing. There is no contextuality. You just know that these neurons were, activated from that point. So, again, so this is not, enough fully interpretable or helps me to interpret the model functioning. Then,

then the methods came, which is SAEs, sparse autoencoders,

which is something which has been,

promoted or established by Anthropics team, Anthropics interpretability team. So where you train a model basis on these activation values, you create a surrogate,

autoencoder

model. But again, you have to train the autoencoder model separately. Right? Which means if your model is bigger, then you will generate a humongous amount of data and you need a large amount of compute table to generate an AC amount. They are able to interpret in terms of how the model is functioning, which is what which is what mechanistic interpretability was, to figure out which neuron is working in what direction. They are were able to use that information for various

things like model steering, for example.

Meaning, if I know which neurons is in using this behavior, can I use that knowledge to steal the model outcome so that I can have the in a desired behavior inside the model, for example? So they try to use SAEs, but the problem is the scalability. And then there is circuit discovery as well, which is, again, has similar problems of scalability. So there are different methods, but all of them either has some kind of gap in terms of scalability

or truthfulness

or, what it can do to explain the model interpretability. Nothing is right. Nothing is wrong except some few things are absolutely bull and, you know, gaslighting anyways. But some advanced methods like SAE, secure discovery, and relevance based methods like, you know, what we introduced or LRP, for example, which is very close to us, are nearly trying to do the job well enough. But now the question is the scalability

and, you know, is it model agnostic? Is it model specific? Is where it can get really interesting. So that that those are the areas where we do a lot of research. Those are the areas where, you know, we do a ton of work around, how do we scale it? What what is required to scale it up, how is it computationally intensive, is it inferencing

in real time, meaning explainability in real time as p as inferencing, or is it going to take a time, how long would it take, those kind of stuff.

And so

the utility of explainability is obviously that it helps you understand more about whether the model is doing what you think it should or whether it is working in the way that you designed it to. And I'm wondering if you can talk to some of the

stages of the overall model life cycle and serving life cycle where you would incorporate

these workloads of doing the generation and analysis

of these metrics to understand what is the model doing and how.

Got it. So actually, so interpretability

is started off to to from a from a direction

where we want to understand the model, which is, for example, to begin with, let's say, okay. Fine. So I build a model.

The model works really well from an absolute qualitative

accuracy metrics or performance metrics. But is it actually the same, when the model is functioning? So there is this classic example, which was classifying

wolfs versus

dogs, I think, in one use case, where the model was performing really well. But later, when they applied some kind of interpretability methods, they realized that the model has learned not the wolf as a wolf class, but the features around the wolf like snow, for example.

Because in most of the images that they labeled as snow as a common, background, what the model did was if there is snow and if there is a bit of dog features in that image, it's wolf. But it's nowhere categorizing and localizing that this is exactly what wolf is. Right? This is probably the best example to, to make a user understand

why interpretability is important. So let's say you're using it for something very sensitive, something like a radiology

image classification, for example. When you feed it a lot of CD scan images, there could be common features for your classes, multiple classes. Speed, let's say, cancer versus non cancer, benign versus non benign,

tumor, for example. There could be a lot of features which are part of your benign benign class, but not exactly related to your, tumor. Right? For example, when a human looks at it, they'll probably look at what is the lump size, what is the radius of the lump size, how close it is to the tissue, how is it growing with the tissue, those kinds of factors. Right? But when you are training a model, it doesn't have it has zero knowledge in terms of approaching that problem. You are building something which is, you know, purely basis on your labeled examples. If you do not have interpretability as part of it, you are making the model learn, which is actually not part of your problem. But somehow, you are you are relying on it. And later on, it can create a huge amount of false positives, which could be really, really problematic or very high risk, high value false positives, which are very much disastrous in nature as well. Right? So when it comes to mission critical use cases, interpretability should be used it or should be used as an audit method to ensure the models are learning proper and to ensure the model is learning really the things that you want it to. Right? Second thing that we realized, actually, if you have the right interpretability method, you can validate your model's

conformity and confidence very well. So I'll give you an example. Right? Let's say you are still doing same image classification

as an example. So you have fed in certain features. And what you realized is model, even though performing well,

it confidence scores or it confidence conformity is not good enough. So when you actually do the interpretability

of it, you know that and also if your interpretive values are quite slow or quite low, which means the model has not properly learned anything. It simply learned few things on the fly and you are getting accuracy because there is common distribution values, you know, those things as well. You could also use interpretability as a method to understand whether the model is actually learning the necessary things and also learning properly or not. Right? These are the two things that has been happening for the past five years, if not for the past decade since the evolution of deep learning. For the past five years, people have started giving importance to interpretability when they're using things for,

in all cases like this. But now what's happening is the interpretability is now evolving as a critical utility in the alignment piece, which is what the next evolution that we believe is an angle of a direction to achieve alignment. Today, you have a classic alignment approach, which is fine tuning. Right? So you go fine tuning or SFT or any kind of, fine tuning to try to align, align the models with the behavior you want. But we know that there are many things that goes wrong when you do fine tuning, for example. So the way that we are exploring our our research currently stands for the interpretive led alignment is how can I use the information that I could get about the model functioning

and use that information to, to achieve the alignment metrics that I want? For example, let's say I want less bias in the model. I don't want to fine tune it. So the way that we were able to test and seems working as well is using interpretability in LMS to figure out which neurons or which parameters

are inducing or, you know, is causing that nature

to have more biased outcomes,

and you can simply prove them. So for example,

let's say I want to remove a gender bias in an LLM today. One way to do that is I'll do some kind of fine tuning to say, okay. Here is my classes and all that stuff. But again, you know that it affects your model performance overall negatively in many cases also. So one of the experiments that we did was we used both DLB, which is our, interpretability technique, and we used LRP as well as benchmarking criteria and then IG. So we have used those three interpretability methods to figure out, which layer and, in this case, which layer and which part of the network which is causing me the bias and then we prune them and we are able to see that the bias is now reduced. The neurons that were responsible for, causing the bias can be removed by simply pruning them out also. If you do linear pruning, there could be a negative effects. This is where you may have to do a bit of unstructured pruning. So that's where now there are a lot of new there are a lot of new research that is happening around in doing unstructured pruning where you can identify

a specific area which is responsible for that behavior and you can simply prune them. This as one example. And then on the fine tuning as well, so, now the utility of interoperability has gone to a stage where I can do something called, safety neuron preserved fine tuning, which means when you build your base model, when you train your base model, you have aligned that model a lot. You have, you know, you you went through a cycles of lot of fine tuning on and all that stuff. Let's say you your focus was around, security of the model. Now when you do any kind of SFT on top of it, it is known that most of your base aligned factors will be lost because you're updating the weights across the neurons in a in a in a common sense in in a common way. Instead, what we realized is if you identify

which neurons or which part of the layer is important for safety side of the network, you can, you can actually freeze, those weights, meaning do not alter them, and update only the remaining weights as required and you will still achieve near similar performance,

but you will preserve all the safety factors of the base model. You can extend that to a lot of new possibilities. But again, so I mean, there are a lot of new directions that we have been exploring and experimenting as well. We have seen a lot of positive, and, the industry is also moving in that direction. So utility of interpretability is not just from how can I understand what's happening inside the model? It can go beyond that, like, to audit the model, to validate the performance,

or even, soon enough, you will able to use it to align the entire model behavior as well.

And when you are running a system that depends on a machine learning model or one of these generative AI models, what are some of the ways that you can use that

alignment and explainability

metric to be able to act as a feedback loop to

the actual system to be able to ensure that you are getting the desired performance or that the model is behaving in the way that you want it to be and not having any issues of bias or false statements or inaccuracies?

Currently, we have seen better success as a post doc method,

not as, a training method. So what I mean by that is once the model is done, so I could use it to audit, I could use it to prune, I could use it to validate. But I may not able to use this as a feedback or as a loss function in my system. We tried it, like, almost beginning of this year. So we thought, can we build a new kind of loss function where I'm using interpretability as a metric? Can I come up with a new new loss function which I could use to train the model? Meaning, it goes back to our thesis, which is if the model learned properly, then the relevance values or the interpretability values will be really, really stable or rather very much, distributed. Can I use that and create some kind of loss function when I'm training the model to say that, okay, you are you're learning the right things? Can I reward you to learn more right things? So this is one experiment that we did, but we realized that it was there is a problem,

in in that construct. Even though we are generating these relevance values, it's we realized that sometimes it's a lot of noise then lot of specific value. So we couldn't quite create these loss functions to focus on exactly the area where we wanted to focus. So this was one problem that we faced. In the recent times, soon enough, probably you will hear from, from our lab on this research as well. We were trying to do the same thing in RL as well. You know? So for example, typically when you generate like in your GRPO or whenever you are training the model, the thought process was can I, can I make the model focus on important part of the prompt, not on the entirety of the prompt when it is doing, you know, RL based fine tuning, for example? This, we realized it can add certain value. But again so this is very early in terms of to say, can I expand this as a training,

as part of my training process? But as a post doc method, this is fantastic. So this is now a new arena where I could use interpretability to align. We also have done an experiment where we are now able to do unlearning

using circuit discovery as an experiment. Again, this is a new paper that we are going to publish soon. I'm talking a lot of these ideas. Some of it is published by other reference labs, but again, so most of it is kind of driven by what we do at our labs. So post doc, a fantastic new technique, a new approach beyond going simple fine tuning. And as part of the training, it's still early. Maybe we will get there. For RL, it could get there quicker. But again, a lot of it need to be tested at scale to say whether this works really or not.

One of the interesting aspects of these types of conversations

is that we use a lot of language that seems intuitive

to discuss these challenges and use cases such as alignment. We want to build trust in the model. We want the model to be reliable.

But as you try to actually define concretely what metrics correlate with each of those concepts, you very quickly get into a philosophical debate

and have to dig through a lot of the assumptions

and ingrained

complication of language. And I'm wondering if you can talk to some of the ways that teams work through that process of saying, okay. We want this model to be aligned with our overall objectives, but now how do we actually determine what are the metrics that correspond to that concept of alignment within these specific parameters, or how do I ensure that trust is appropriately

recorded and adhered to given the specific use case that I actually have?

So that's a extremely

contextual

problem, Roy. Meaning, as you rightly said, so alignment to me is very different. Alignment to you is very different. Unless we agree on certain

common metric that we track, it's extremely hypothetical metric and, a debatable ideology as well. Again, so this is we had a similar problem even in interpretability.

Meaning,

how you explain the model to a, let's say, a data scientist versus to a researcher

versus to a regulator versus to business is very different. Right? Even for a business, it's person to person as well. Meaning, I may I may want the explainability in certain manner. Visa was but another person may want the explainability in a certain manner. Because everybody understands this differently. Everybody wants to correlate with their scheme of things and all that stuff. It has to come to a stage where there is a common evaluation metrics or else this will not scale, not even evolve, from what the current condition is. But luckily, I think we already have certain metrics or datasets that we are looking at. As an industry, for example, like let's say you wanted to have model alignment from a safety point of view. So then again, the question is which part of the safety are you talking about? Is it on, let's say the safety from jail abuse or is it safety from, exploitation of the model, safety from, you know, a wrong factors like chemical, biological warfare? A broad definitions

doesn't work. You know, we have gone a stage three years back where the alignment was a philosophical,

a debate topic. Now it has to get factual, and quantitative,

or else it's it's not going to scale. Meaning on what metric would you say the model is aligned? If you don't specify

what do you mean by alignment. Right? So the problem definition has to be really clear and the follow-up validation metrics also need to be clear. In our case, we are most of these are data driven for us. Meaning, when we say bias, we try to have a datasets which has the bias that we can validate upon. And then we say, okay, here is a metrics that we use as as simple as let's say, you know, f one score on on on, for a sensitive class, for example, as my bias metric. It can get very, very simplified,

very easy to track to begin with. Once I do that fundamentally, then I can scale up to saying that, okay, fine. Is it aligned to a business requirement? Is it aligned to a regulators requirements from that point of view? Again, all of this can evolve. So I would say the clarity of definition

to the point that it's it's focused on to the dot, problem

And a clear simplified

evaluation

is quite important. The approach to solve is complex enough. So let's not try to over complicate the evaluation as well.

And so for teams who are invested in actually collecting and acting upon these metrics and insights of explainability, alignment, interpretability?

What are some of the supporting systems that are necessary to be able to operationalize

and scale that information

and be able to effectively take action on that beyond just the capability of putting in a model into production either via an API or running your own inference server?

I don't think there is maturity in the current process today. Right? So for example, for a very long time, people thought chain of thought is a reasoning, is an interpretability

for LLMs. It was so stupid that people thought that was an, an interpretability method. Chain of thought is simply the prompts that is going in and the prompts that is generated. It never explains what's happening inside the model. If you look at similarly what, what is scaled from an explainability in a very simplistic, for a statistical model, for example, which is very much easy. Right? So I have a linear regression, for example, or a linear,

regression kind of model. It's quite easy for a user to understand, okay, these are the factors that is influencing the model, which is a a and b, which is coming out as y, for example. Right? It's very easy. I know what those factors are, which can influence the model. I'm not saying I have given this 10 features that is going in and that's how I'm explaining it. That looks stupid, right? If I say, okay, fine.

Age, gender,

occupation,

these are the 10 features,

I'm passing to the model. That's not explainability. You are, is that just model input? The same is what need to happen to current systems as well. So if anybody says there is an interpretability, you better explain what's happening inside the model. Don't explain what is going into the model. If what is going into the model is interpretability, then every system should act in the same manner, in the same manner. Right? If not, then there is the model that is influencing the output, which means you have to understand the model behavior as an interpretability

thesis. If you're not doing it, if you're only looking at the input, then you're not interpreting the model. You're simply doing the pipeline explainability.

But this is not matured,

yet because it's quite complex to do that for an LLM today. So you have SAEs, which is what people have used to some extent at the moment, but nobody is looking at explaining an LLM as a critical enterprise requirement.

But it is now a requirement because even today people were using it as a human in the loop system or a human augmentation system. If you are going to use that for a very complicated

real life export predictions,

What we used to do for statistical systems where you are explaining each parameter, how the parameter is influencing the output, so you may have to do that for this. Or else a regulator will ask the same question. Right? You are doing that for for this model, whereas you are not doing that for this model. You are simply giving me the pipeline explanation.

Why can't you do a similar thing? So then, you know, there will be questions around, regulations. There will be questions around,

reproducibility.

There is questions around accountability, meaning whom do you account to? If you're explaining the pipeline, then the accountability always goes to the guy who builds it. Right? But if the model is bad, then a model builders will never take the owners because

they can build the model in any manner, but experiment is not part of it, which means they will never be gone government did not push on how the model is behaving. This is what is kind of getting into a lot of debate between the regulators and model manufacturers, particularly the GPI, which is general purpose AI models, wherein, they are getting into debate with regulators saying that the systems are like this. What can we do? We can't explain all that stuff. But nevertheless, I think this is where it has to get to. In the enterprise, the pipelines are established for certain modeling techniques like statistical model, classic ML. There are very good pipelines to do interpretability

at least within large organizations, if not with the regulators. But it's very,

it's very much not developed when it comes to very complex, very frontier models today because this is very, very new from an industry standpoint. But slowly, it will get that. And and what's happening on the other end is people are investing a lot on traces, which is observability and the guardrail systems, for example, which is a good start, which is why many people are investing today because it's an easiest way to get right to do the same thing. It it's not full fledged. It's not 100%, but it at least 10% that you can do with less amount of effort with very, very simple cost. People are able to do trace logins for an agent that is functioning. For example,

people are able to apply guardrails. It's surprising that people use agents without guardrails, whereas text with guardrails, it's very surprising. People also should apply guardrails on agents and agents,

you know, traces and scratch pad as well. But slowly, that's where a lot of these products are now coming towards to to to enable that to enterprises. That's why it's not a surprise to see that OpenAI has launched an agent tracing, agent building platform. Mistel has launched an agent tracing,

platform. And also because people know that they have to solve these problems or else no matter how friendly the model is, it'll never see the full potential.

Another interesting aspect of the collection and utility

of these metrics as you're building one of these systems

is that one of the

major areas of focus right now is on building these evaluation

harnesses for your model to make sure that as you change the prompt, as you change the data, that you're still getting the overall desired outcome. And I'm wondering how you're seeing teams take action on using some of these explainability

and alignment metrics to then

be a factor in the determination of whether their evaluation suite passes or fails and how that feeds back into the

design of the prompts and the overall context engineering that they're putting into these systems?

I don't think they are using any kind of inputs from interpretability or alignment as part of the evaluation criteria today. It's a brute force method. Right? So, again, twenty years back, how we are how we used to evaluate any ML system at that point of time is accuracy. Right? You have this set of examples. You see what model is performing. Is it performing well enough in that, you know, accuracy and performance criteria? And if it is so, then you say, okay. Fine. The model is working okay. Which is, as I said, it's a great starting point. Right? Which is what everybody is doing. So they are creating these evaluation metrics. They are creating synthetic prompts. They are creating synthetic eval criterias

to try to validate certain metrics. But again, going back to my example, it can classify an image very well. Your accuracy is quite high. It doesn't necessarily mean that it learned the actual features well enough. When things go wrong, it will go wrong quite quickly, quite badly, which happened in the case of OpenAI many times when they release a frontier model. There was always jailbreaks. There is always these kind of stuff. But again, as I said, so if interpretability and alignment so new, forget about

making that as a common metric across the pipeline. It's very much currently

specialized or restricted

within

maybe for fine tuning from an alignment point of view and interpretability in some cases, safety specifically for certain players like Anthropic. I'm calling it Anthropic because they are visibly has invested a lot on a mechanistic interpretability

and using that as a method to do alignment and fine tuning as well. Whereas many other players, not so much. But yes. So, it's still a very under evolved or evolving area. Current evolves is as good as what probably, you know, any person would do as a basic one zero one exercise. It's surprised to see that they it's it's not it's happening now. It should have happened like from the day you are productionizing the system. But luckily it's happening. Somehow, I'm also a little surprised that this topic evaluations

as a tool gained too much of attention, which was from a from any AI guy, it like like an obvious point. It should be the case. Why are you hyping it so much? Like, if you're not done it, then you're doing it wrong for the since since the beginning. Right? So this should be like a basic one on one thing should have should have happened from day zero.

And for teams who are

setting up these

interpretability

and alignment metrics and they're trying to make that part of their core practice for how they actually design and implement these overall systems, what are some of the biggest challenges that they are encountering as far as being able to

incorporate them, whether that is because of technical challenges or organizational buy in or just the integration

of these metrics into existing systems, etcetera?

Got it. I think the best way to explain this is to categorize the systems like, let's say easy, medium, complex, and complex. Easy is all your statistical models, you know, which are intrinsically explainable. And then you have medium complex, which are, let's say classic

ML, old age deep learning. These are medium complex and complexes, let's say LLMs, frontier models, those kinds of stuff. So easy and medium complex category of AI models. There is a good amount of product maturity there, which is where people have been investing. Like, banks has invested a lot, particularly from an MRM MRM point of view, which is model risk management point of view, these systems. And it's very much when it comes to complex systems. Right? As I said, the interpretability and alignment has never gone into AI engineering phase. It's still part of your core modeling r and d phase itself, which means only the frontier and model builders are focusing on these problems because they know that they know that these are important for them to align the model, to make sure the alignment is done right. And for them to release the model, they have to have better systems too, better methods and approaches to align the models

or to integrate the models. Right? That's why people like Anthropic uses uses these approaches

before or as part of their red teaming exercise before they release, the weights of the model or the APIs of the model. So when it comes to complex systems, the Frontier companies, the Frontier Labs are only focusing on this. This is what I was saying. So this this itself is quite new in that, complex,

algorithms area. Whereas interpretability and alignment is something that is very productized and,

many industries are looking to scale this as a product and it's easily

achieved and consistently deliverable

for easy and medium complex AI systems. When it comes to complex systems, it's still in the R and D labs, particularly with the friendlier companies. Whereas for easy and medium complex AI systems, there are commercial products in the market and they will continue to be more effective commercial products in the market, which can which are which are kind of scaling in the market.

And for teams that you're working with who are investing in this overall practice of trying to make sure that their models are properly aligned, that they are trustworthy,

what are some of the

either points of confusion or aspects of education

that you have found to be either most lacking or most critical as you're working with those folks to make sure that they're actually getting up to speed and using these insights effectively?

So most of our customers are primarily large financial institutions, regulated entities like financial institutions for insurance, these kind of players. So earlier, we were deploying decisioning systems using easy or medium complex AI systems, which enable them to deploy these decisioning systems

across multiple use cases with confidence. Because even if I use deep learning today, I can explain it, which means if I'm using deep learning for, let's say, an underwriting problem or a fraud detection problem, I can explain it more detailed in a very, you know, truthful manner that is required or regulated, for example, which means they can upgrade from a simplified systems to a complex system. For for for very, very complex systems like LLMs, for example,

most of our workers doing this red teaming as part of our internal private LLMs. Meaning, these banks wants to use LLMs which are more aligned, safety and control in control. They want to they want to have the endpoints,

which

within their environment,

weights, within their environment. For them, able to deliver this at scale, which is alignment and safety is quite

important.

So those are the customers whom we work with, more and more. I think we are seeing more amount of demand for that where, people are now more interested in creating their own private LLM endpoints because they know that open source have caught up with private systems and you can actually match or even exceed the performance,

when you do certain things really well, which is fine tuning and and these kind of stuff. But they want to ensure that these open source models that they use, they have a version of them which is more safer, more aligned, more ready to use. So those are the kind of customers where we work with. In a way, we are kind of launching

a frontier model specific to an enterprise by applying these approaches and many new experimental approaches as well. So there are value adders, how well we can come up with new approach and at the same time, how well we can adapt to any new, technique that comes up, in in these areas.

And in the context of building a frontier model for these types of use cases, how does that relate to patterns such as LLM as a judge where you're using a model as the interpreter of whether the model under test is actually performing in the way that you want it to.

That's a good evaluation tool, but it's neither an interpretive technique nor an alignment method. So you can scale up your evaluation

criteria and approaches.

There we do not have any caveat. So we always taken the download from the community in terms of what works and how well I can evaluate. But when it comes to fundamentally fixing the problem so in evaluations, you are finding a problem, right? You are not necessarily fixing it. In, interpretability or alignment, you are fixed you try to fix that problem by doing, let's say, ASFP or safety fine tuning, for example. So these can or or unlearning or pruning, all these new methods that we are trying to. We want to find

an more robust fixing these fundamental challenges with the foundation models while the evaluations

and evolves can always evolve from time to time anyways.

And as you have been working in this space and working with customers, what are some of the most interesting or innovative or unexpected ways that you've seen teams

either collect

or apply these various methods to understand the operation of their ML and AI systems?

Yeah. So one of the common things that we have, seen in the, specifically in large organizations or highly regulated organizations is, as I said, so even if we take an example of an interpretability,

right? So in,

interpretability,

which leads to creating an explanation, which is very much a user specific. Which means when we were doing this for a large organization, they would say that, and also I'm explaining this to be part of my explainability or a prediction. And another user, which is primarily a regulator will say, okay. No. No. I want this to be part of my so we have created something called an explainability agent. It's nothing fancy. It's, as simple as, a common engineering,

thought process. Right? Where you would say, fine. I'll collect multiple artifacts from the base systems. Meaning, we have used our DLB, which is our technique to try to do interpretability, which creates a bunch of artifacts at token level and and for all the tokens in the answer as well. And then we combine that with multiple other methods and then that coupled with the model behavior at that time, meaning we try to track the model performance like drifts, whether was there any drift in the data? Was there any deviation in the, in in the inferencing inputs?

We give that information also. So what we necessarily created as an exhaustive explainability agent, which will take a lot of information from the the base engineering platform that we have, which takes the explainability output, model telemetry,

and the process telemetry as part of a single input to create a user specific explainability

outcome for each prediction, which is as simple as when a user logs in basis on the user background, the explainability changes because you now have an agent which can convert that information for you, in a very simplified manner that you can understand and you can consume it. So these are, like, very simple things from an engineering standpoint, but extremely useful to an end user because no point for me to create an integrity output which are, like, extremely complex information. If a user doesn't understand that information, that is still becomes a gibberish to the user. Right? This makes it very easy for us to not only be technically,

you know, grounded, but also use the agent to be able to bridge the technology gap or tech language gap basis on the end user. So that seems to be working really good for us as well. So sometimes

some of these simple things can scale up really good. And in your own experience of working in this space and doing research on appropriate methods and effective actions to take based on the metrics you collect for interpretability

and alignment and trust building, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?

Got it. So, again, so at least when we started, our thesis was largely on interpretability and interpretability led alignment. What we are doing in the last six months had been a lot of friendlier work. So we are using the observations

from interpretability

in multiple things, like safety neuron fine tuning, for example, or using that information to do pruning to induce a behavior. We also used a similar approach to compress the model as well. Meaning, to typically compress your model, you do things like quantization,

you know, or you do, a path kind of, fine tuning, those kind of stuff. Right? What we realized is

sometimes you can simply prune the neurons to zero, which, when you build a neural network, you build massive amount of overload. Not all neurons are used when you train the model. You can simply run a task and you know which neurons have been really useful for you to execute the task. Not exactly I don't think this maps with the activation maps to that extent. In this case, we are able to understand that by calculating the relevance of each neuron or each layer to figure out is that neuron is actually relevant for you or not, and we simply prune them. So we are able to reduce the model size in one example where we published a paper in Mikay, where we were able to prune a simple CNN model also by almost 4040%

less less number of layers, params, and still have near similar performance, which means I can actually, you know, compress a model and still maintain same accuracy or still maintain same performance, for example. So some of these had been really interesting. We know, theoretically, these are the things it should be possible if I could do it. In the in the last few months, this is what we have been executing,

upon and and building upon. I think our, what we are trying to build as a tool also is something like we we call it internally the project name as, AlignTune, you know, where the project is, primarily to create a CRISPR style

a gene editing kind of tool where you can not only identify the neuron that you wanted for your behavior and use a follow on things with it, which is pruning as a strategy or fine tuning as a strategy

unlearning as a strategy, or you can merge multiple models into one,

model as well. So this is what we are currently working on as a project, like a massive large project. It has not done ever in the current industry scheme of things. It will take some time for us to to release it, but when we release it, it will probably be one of the frontier things to be done. You will have a model editing like a gene editing where you can have your induced behaviors

and edit inside the model, not just outside like guardrails or evaluations.

And then do follow on things with it to further

refine that behavior more and more. So this is what we are very, very excited to see to come out very soon is what we are hoping to deliver.

And with that capability

of identifying and pruning neurons that you don't care about for a particular use case, how do you see that impacting the overall efficiency

and compute requirement of these models and being able to drive down the cost of operation?

So as for whatever we have done so far, the results were super positive in terms of, okay, if I reduce, let's say, twenty percent number of params inside my model, which means it can directly help me to to speed it up. Even 10% efficiency for a billion calls is like massive value add from a borderline perspective. And from overall performance also, so it depends on how much you want to prune. For example, we have seen 20% is a very comfortable pruning threshold. Sometime between 15 to 20%, there is almost zero loss in performance. It also depends upon what is your pruning method as I said. So are you doing linear pruning or unstructured pruning? It depends on that also. And also the quality of the samples, there's a lot of hyperparameters

around it, which will influence it, but you can easily do 20% pruning really well. If you want to go from 20% to, let's say, 50%, for example, that's when things get really, really complex. Because 50% pruning meaning you are going to remove 50% of your neurons or layers,

which will have unintended

effects eventually. Right? So

that's a very as you prune more, you will have more unintended effects, which is what you probably need to evaluate a lot on different different angles to ensure you're pruning the right but as a strategy, this is quite

scalable is what we believe we've we've already demonstrated on the base set of models. Now we are scaling that to a larger model with, different kind of approaches. Possibly as when we get there, maybe you can able to prune really well and be really efficient with our with zero loss in accuracy is is is what our goal is.

And for teams who are

deciding

whether and when to invest in this overall

capability of

evaluation

of the explainability

and alignment of their models? What are the cases where you would either say it's overkill or advise against it maybe because they're not far enough along in their overall maturity curve for AI capabilities?

Got it. I think from an importance or scale of importance, this has to be done at a foundational

model level anyways because you are going to distribute that to n number of users where you do you may or may not have control. Both in closed weights approach or open weights approach, you do not have control on the end use case, which means fundamentally, it's quite 100% important to do that, which is why most of the frontier companies differentiate upon the alignment metrics. Right? Like construct values or alignment values, for example, they got differentiated on it. When the performance difference

is point decimals in in percentage points, these some of these things becomes really critical for an enterprise to look at. And and that's one of the reason why Anthropic is winning in the enterprise market. Right? Because they're very serious in them when it comes to model security and safety. Whereas OpenAI is, in many cases, liberal in in some of these aspects. I would say when it comes to when, releasing weights or or or foundation model weights, this is 100% important. Our criteria is quite high. And if you're using it for any sensitive use cases and regulated use cases, this is 100% important because you are that's not a choice. It's mandate. You cannot change it. If you're not doing it, then you are violating the law. Some regulators, some customer will sue and then you are in a bigness. It's an overkill where the use case is very liberal or linear. Meaning, if I'm using it as something like a I mean, maybe a calling agent, something of that sort, it's okay you not doing it. Not doing it to a 100%, maybe you will still do it to ensure there is no bias. Meaning, your calling agent is not resist and calling people and saying that random,

random things. But I would say, it's it's all depends on the use case curve. As the use case gets sensitive,

it's, the importance to do this is closer to a 100%. If the use case is less sensitive, the importance to do this maybe 20%. Percent. I would not call it zero, meaning you always have to do align alignment and interpretability. If not, then you will not need guardrail systems anyways, eval system anyways, which is not the case. You will when you're using some AI system, you will have some kind of guardrail, some kind of eval. Likewise, you use some kind of interoperability, some kind of alignment tactic. It may not be true to the model. It may not be scalable. But as the use case get sensitive, complex,

regulatory heavy, involves loss in life kind of scenarios, then it becomes 100% important that you are doing these things to the fullest potential, which is where there is a lot of enterprise value to solve these problems.

As you continue to invest in this overall space, what are some of the things you have planned for the near to medium term either in your own work or maybe some predictions or requests that you have for the overall industry?

Yep. So I I talked about the Aligntone project. Right? Like, that's a massive project that we have, massive r and d project. We have segmented that into multiple subprojects,

which is where if can see our research papers, for example. So we have already published a few papers around pruning, which where we are able to validate and demonstrate

the approach well enough. Soon you will see things around, again, pruning as a strategy to fix bias or to do unknowning. Few papers are going to come very soon in that direction.

And then we have something around fine tuning, which is safety neuron fine tuning, where certain labs in, China, for example, like Shanghai Lab, the Shingwa Lab, for example, they have done some work in that direction. It looked very, very interesting, from their observations as well. Soon enough, you will see more papers coming in from that point of view as well. So we divided this Alentuna as a big,

r and d, into five,

different r and d projects, which is where we are doing a lot of lot of work. And then all of it will get consolidated into the tool when we release it. It will be an open source tool so such that the community can use it, you know, more so to to try to use more advanced methods in a very democratized manner. And the predictions point of view, I think this will now become super critical.

It's already become super critical,

as I was saying for Frontier models. Like you are releasing an open source models.

The geographical regulations are now kicking in in certain geographies like new AI act, in Europe, which means if you're releasing any kind of GPI,

general purpose AI models, you have to be compliant on certain things, which means, again, it's not a luxury. It's a mandate. So I hope, people will, will start investing on this direction more and more, the interpretability

led alignment. Whereas alignment as an area, there is enormous amount of attention

around reinforcement learning side of things, around fine tuning side of things today. Maybe more than what it should be, but, suddenly, it's now, the big team of r and d for many people doing things around the reinforcement learning, which is the case for us also separately, which is where we have an r n RL team working around new kind of RL techniques. I think we will see more new RL techniques which are more more efficient for sure, which every lab will able to propose new methods. We will do ourselves as well for sure very, very quickly. In order to let alignment or that direction, there are few labs who are doing very great amount of job like Anthropic, Goodfire. These guys are doing very good amount of job on that direction. But, alignment as an area of commercialization,

I believe we will start monetizing it in the next six to twelve months, wherein how you have been seeing things like guardrails, Evals, getting that commercial on option. You would see a similar story for, interoperability and alignment as when these can able to deliver stabilized, consistent,

scalable value going forward, which is where probably the industry is going towards to as well. So that's what my hope is. But otherwise, I think, labs like thought machines, for example, or SSI lab or kind of building upon these themes anyways. Right? Which is what I was saying. So the foundation models differentiate basis on multiple things. Performance is too too narrow term right now. Usability is not where they are differentiated. For usability, solving these things is quite important. If I say that I'm as equal as or maybe 1% less than OpenAI, but I'm really aligned, I'm really safe, I'm I'm really, really sophisticated, people have more preference to this, particularly on the enterprise side, even though you are 1% less than the industry leader. But if you're usable and and acceptable, you will get more scale, more money in the business, anyways.

Yeah. Are there any other aspects of the work that you're doing at Aria

AI or this overall space of explainability and interpretability

and alignment and trust building that we didn't discuss yet that you'd like to cover before we close out the show?

I mean, one last thing that we have not touched upon too much is on the AI risks. You know? So, again, to a broad extent, risks can be part of the alignment story that we are building as well, like the safety neuron, fine tuning, all that kind of stuff. But what necessarily is happening is these models are not as simple as ten years back, deep learning was a complex model. So they are very hard to interpret, understand or build upon. But what we are seeing now is like a huge, huge complexity. In fact, most of the problems that I said is probably works really well on, one modality. When you start combining modalities, let's say you have,

any to any kind of model where you have both image coming in and text coming in, which is still okay. But if you're building something like world models for that, like Pify is working on world world model, Meta is working on world model, Google is working on world model. In that case,

the problems around interpretability and alignment is like massive. I don't think we would have the capacity to be able to solve those problems ourselves. This is air this is where our next bet is, wherein I think we've soon enough, maybe ourselves or someone else will have an AI to do this, not a human driven approach, which can identify a problem and fix the problem rather we doing it, then it's a matter of how much we can enable that AI to be able to do this job. Right? In a very simplified

problem of the similar scale as something around, cybersecurity, which is fine. So you have a system. Now you have an agent like OpenAI security agent or Gemini. Google security agent, which was launched very recently, have a similar approach. Right? So you have a system and the agent should figure out what gaps are there in the system

and should should fix it and or should continuously defend it. Right? You take the same thesis to an AI as well. So you have a Frontier model and the AI should able to identify the gaps inside the Frontier model or inside the AI solution. It should able to fix it or it should able to defend it, you know, during the inferencing. That's the scale that we are talking about. So you are if your blue pill is your foundation model, your red pill or the safety net is not a software. It's probably an AI that plays this role. I think it's important we should build it quicker and sooner and catch up with the foundation models. So, yeah, that's, like, not star vision for us that we have in terms of getting

there. Alright. Well, for anybody who wants to get in touch with you and follow along with with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'm interested in your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.

No. I think people are getting smarter anyways. Right? So, in in one way or the other, they are the only,

okay. So, I have a very it may be controversial as well. So I'm not too favorable to wipe, coding. I mean, it's good to get started for sure, but it's not, good to be an Excel art. It builds a lot of redundancy

slash laziness, if that's the right word. So typically what happens is when it comes to debugging, right, you will learn a lot during debugging than building. That's what I believe in, in many cases. If, when,

when people use, things like web coding to do that, so the debugging becomes very sparse, which means your ability to do debugging becomes less and less obvious, which is what many people were complaining about as well. Right? So it it takes thirty minutes to build through web coding and,

one whole day to figure out what went wrong. You know, so that that's the case. But nevertheless, I think, and also the information

is so much out there and, the consumption is also getting smarter. Smart people will always find the smart tools to figure this out. It's a matter of building those smart tools and and the habit of, using them more often and more rightly than ever for differentiate many people. Yep. So that's that that should be the case. Alright. Well, thank you very much for taking the time today to join me and share your insights and expertise in this overall space of understanding

how models work and the decisions that they make and how to apply those findings to build better ML and AI systems. It's definitely a very interesting

and, obviously, very important aspect of this ecosystem that we all find ourselves in now. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.

Thank you, Tobias. Pleasure being here. Looking forward. Yeah.

Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast.init

covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@AIengineeringpodcast.com

with your story.

AI Engineering Podcast