DeepSeek, a Chinese artificial intelligence (AI) lab by High-Flyer startup, has kicked off its “Open Source Week” by releasing FlashMLA, a decoding kernel designed for Hopper GPUs. It is optimised for processing variable-length sequences and is now in production.
The kernel supports BF16 and features a paged KV cache with a block size of 64. On the H800 GPU, it achieves speeds of 3000 GB/s in memory-bound configurations and 580 TFLOPS in compute-bound configurations.
DeepSeek says FlashMLA is inspired by projects like FlashAttention 2&3 and Cutlass. The kernel is available on GitHub for exploration and use.
“Honored to share FlashMLA – our efficient MLA decoding kernel for Hopper GPUs, optimised for variable-length sequences and now in production,” the company said in a post on X.
The release of FlashMLA is expected to improve computational efficiency, particularly in applications involving AI and potentially impacting sectors like cryptocurrency trading algorithms. FlashMLA, available on GitHub, offers high performance with speeds of up to 3000 GB/s for memory tasks and 580 TFLOPS for computing.
DeepSeek recently announced it is launching five open-source repositories starting this week. “We’re a tiny team (at) DeepSeek exploring AGI (Artificial General Intelligence). Starting next week, we’ll be open-sourcing five repos, sharing our small but sincere progress with full transparency,” it said on X.
Currently, it has a collection of 14 open-source models and repositories on Hugging Face.
Recently, it released its DeepSeek-R1 and DeepSeek-V3 models. These AI models offer state-of-the-art performance while being trained and deployed at a fraction of the cost of their competitors.