
Munich/Montreal
Blog

The relentless demand for artificial intelligence has collided with the reality of hardware scarcity. As compute requirements outpace GPU supply, maximizing software-level resource utilization has become a critical mandate. This urgency has recently sparked an explosion in AI-driven GPU kernel optimization. Over the last ~1.5 years, a surge of papers, open-source initiatives, and engineering blogs have demonstrated that AI agents can write - in principle - efficient, low-level CUDA and Triton code for isolated mathematical operations.
However, a critical question remains: do these isolated kernel victories actually translate into tangible, end-to-end model speedups?
Historically, the answer has been underwhelming. Most research focuses on single kernels or toy examples, and when these techniques are applied to real-world architectures, the overarching impact often dilutes dramatically. For instance, Google's AlphaEvolve achieved a roughly 1% speedup in Gemini training. Bridging the gap between a fast isolated kernel and a fast production model is notoriously difficult.
Recently, our agentic AI compiler, yasp.compile, shattered this ceiling. Benchmarked against torch.compile, we achieved over a 3x speedup in prefill latency and nearly a 3x speedup in end-to-end text generation on IBM’s Granite Hybrid 4 model family (with up to 7x gains over raw eager execution). Crucially, this was not a synthetic benchmark; it was a fully functional execution running directly from the official Hugging Face implementation.
How did we achieve this leap? The process begins with standard profiling. By tracing the official PyTorch implementation of Granite 4.0, we identified the dominant bottleneck: the GraniteMoeHybridMambaLayer (for a great introduction to Mamba and how it fits into modern LLM variants see here). Instead of struggling with the complex control flow of the entire LLM, we extracted just this bottleneck layer as a standalone PyTorch nn.Module. We then dispatched this module—along with its input tensor shapes—to the yasp.compile agent.
The agent autonomously analyzed the computational graph and returned a highly optimized custom implementation (see how we did it here). We then patched this optimized layer back into the full Hugging Face Granite model, injecting it in places of the original implementation while copying over the untouched weights.
This architecture highlights a major advantage of our SDK. Because yasp.compile operates on the mathematical graph of a standard nn.Module, it seamlessly traces through third-party libraries and optimizes the layer without ever needing access to the model's weights. This completely eliminates the need to ship prohibitively large files and ensures the privacy of your proprietary data and code.
The end-to-end results on an NVIDIA H200 are striking, and crucially, the performance gains apply to different model sizes.
granite-4-h-350M: Prefill completed in 98.9ms—3.27x faster than torch.compile (and 7.0x over eager). Full generation achieved a 2.88x end-to-end speedup over torch.compile.
granite-4-h-1b: Prefill completed in 195.6ms—3.11x faster than torch.compile (and 6.4x over eager). Full generation achieved a 2.04x end-to-end speedup over torch.compile.
Importantly, performance did not come at the cost of accuracy; both optimized models generate the exact same text outputs as their reference counterparts.
Ultimately, true optimization must be measured not in micro-benchmarks, but in the wall-clock time of production models. By marrying agentic code generation with seamless PyTorch integration, we are proving that autonomous compilers for today’s most complex ML workloads are feasible in principle.
How to get involved
If you’ve ever shipped an AI model thinking, “this should be faster, but there’s no time to tune it properly,” this program is built for you and your team.
Apply for early access, get hands-on with the Agentic AI Compiler, and influence where the roadmap goes next.