Published on February 20, 2025
In AI News

Sakana’s AI CUDA Engineer Delivers Up to 100x Speed Gains Over PyTorch

The AI CUDA Engineer has successfully translated more than 230 out of 250 evaluated PyTorch operations.

by Siddharth Jindal

Japanese AI startup Sakana AI has introduced The AI CUDA Engineer, an agentic framework that automates the discovery and optimisation of CUDA kernels for improved GPU performance.

The company claims the framework can generate CUDA kernels with speedups ranging from 10 to 100 times over common PyTorch operations and up to five times faster than existing CUDA kernels used in production.

CUDA is a low-level programming interface that enables direct access to NVIDIA GPUs for parallel computation. Optimising CUDA kernels manually requires significant expertise in GPU architecture. Sakana AI’s new system uses LLMs and evolutionary optimisation techniques to automate this process, making high-performance CUDA kernel development more accessible.

“The coolest autonomous coding agent I’ve seen recently: use AI to write better CUDA kernels to accelerate AI. AutoML is so back!” said Jim Fan, senior research manager and lead of embodied AI at NVIDIA. He added that the most impactful way to utilise compute resources is by enhancing the future productivity of that very same compute.

According to Sakana AI, The AI CUDA Engineer converts standard PyTorch code into optimised CUDA kernels through a multi-stage pipeline. Initially, it translates PyTorch operations into CUDA kernels, often improving runtime without explicit tuning. The system then applies evolutionary optimisation, using strategies such as ‘crossover’ operations and an ‘innovation archive’ to refine performance.

“Our approach is capable of efficiently fusing various kernel operations and can outperform several existing accelerated operations,” the company said. The framework builds on the company’s earlier research with The AI Scientist, which explored automating AI research. The AI CUDA Engineer extends this concept to kernel optimisation, using AI to enhance AI performance.

Sakana AI reported that The AI CUDA Engineer has successfully translated more than 230 out of 250 evaluated PyTorch operations. It has also generated over 30,000 CUDA kernels, of which over 17,000 were verified for correctness. Approximately 50% of these kernels outperform native PyTorch implementations.

The company has made the dataset available under a CC-By-4.0 licence on Hugging Face. It includes reference implementations, profiling data, and performance comparisons against native PyTorch runtimes.

Sakana AI has also launched an interactive website where users can explore the dataset and leaderboard rankings of optimised kernels. The platform provides access to kernel code, performance metrics, and related optimisation experiments.

📣 Want to advertise in AIM? Book here

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.

Motive to Hire 300 AI Professionals in India for Life-Saving Tech

These 5 Companies are Hiring AI Engineers This Month

Unpacking Parallelism: Practical Strategies for Scaling AI Workflows

AI Coding Could Be Indian IT Engineers’ Biggest Threat

Sakana.ai Introduces Transformer2, a Self-Adaptive AI

Sakana.ai Introduces Transformer², a Self-Adaptive AI

ECI Mandates Labelling of AI Generated Content in Political Campaigns

‘India Has Quietly Lost the GenAI Bus & No Amount of Funds Can Cover it’

‘India Has Missed the GenAI Bus and No Amount of Funds Can Cover it’

Junior Devs are Letting AI Do the Heavy Lifting—But at What Cost?

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Happy Llama 2025

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

‘Most Data Centres Are Not Ready for Liquid Cooling’, says Oracle Exec on NVIDIA Blackwell

Siddharth Jindal

Built on the Blackwell architecture introduced last year, Blackwell Ultra features the NVIDIA GB300 NVL72 rack-scale solution and the NVIDIA HG B300 NVL16 system.