Japanese AI startup Sakana AI has introduced The AI CUDA Engineer, an agentic framework that automates the discovery and optimisation of CUDA kernels for improved GPU performance.
The company claims the framework can generate CUDA kernels with speedups ranging from 10 to 100 times over common PyTorch operations and up to five times faster than existing CUDA kernels used in production.
CUDA is a low-level programming interface that enables direct access to NVIDIA GPUs for parallel computation. Optimising CUDA kernels manually requires significant expertise in GPU architecture. Sakana AI’s new system uses LLMs and evolutionary optimisation techniques to automate this process, making high-performance CUDA kernel development more accessible.
“The coolest autonomous coding agent I’ve seen recently: use AI to write better CUDA kernels to accelerate AI. AutoML is so back!” said Jim Fan, senior research manager and lead of embodied AI at NVIDIA. He added that the most impactful way to utilise compute resources is by enhancing the future productivity of that very same compute.
According to Sakana AI, The AI CUDA Engineer converts standard PyTorch code into optimised CUDA kernels through a multi-stage pipeline. Initially, it translates PyTorch operations into CUDA kernels, often improving runtime without explicit tuning. The system then applies evolutionary optimisation, using strategies such as ‘crossover’ operations and an ‘innovation archive’ to refine performance.
“Our approach is capable of efficiently fusing various kernel operations and can outperform several existing accelerated operations,” the company said. The framework builds on the company’s earlier research with The AI Scientist, which explored automating AI research. The AI CUDA Engineer extends this concept to kernel optimisation, using AI to enhance AI performance.
Sakana AI reported that The AI CUDA Engineer has successfully translated more than 230 out of 250 evaluated PyTorch operations. It has also generated over 30,000 CUDA kernels, of which over 17,000 were verified for correctness. Approximately 50% of these kernels outperform native PyTorch implementations.
The company has made the dataset available under a CC-By-4.0 licence on Hugging Face. It includes reference implementations, profiling data, and performance comparisons against native PyTorch runtimes.
Sakana AI has also launched an interactive website where users can explore the dataset and leaderboard rankings of optimised kernels. The platform provides access to kernel code, performance metrics, and related optimisation experiments.