终于把DeepSeek V4 flash给部署起来了，llama.cpp终于适配起来了，gkd

yeyucca 2026-07-03 18:02 1

deepseek v4 flash，非果子设备，要不antirze的metal框架就出来了，

不过用的模型还是他家的，用的IQXXS2bit版本的，86.7GB显存，用的两张5090卡，单卡32G，好在服务器还有512G内存，

llama.cpp前些天才能更新上去，所以模型出来，框架还要适配，很多之前都是0day，这次等了这麽久

记得要把框架含这个的release给含进来，要不低于**b9840（哈哈，够严谨）**

github.com/ggml-org/llama.cpp

DeepSeek V4 (#24162)

master ← am17an:dsv4

已合并 08:58AM - 29 Jun 26 UTC

am17an

+4698
-40

## Overview

This PR adds support for the deepseek-v4 models. The most no…vel part of this architecture is the compressed attention. There are two types:

1. CSA (Compressed Sparse Attn) - it is a variant of DSA (introduced in DeepseekV3.2), it operates on the same principle of the lightning indexer to get top-k tokens to attend to, except tokens are "compressed tokens". A compressed token in CSA is every 4 tokens compressed into 1. It maintains a window of the last 8 tokens and does this at every 4 token boundary

2. HCA (Heavily Compressed Attn) - it is like normal attention over compressed tokens plus SWA, the compression being large at 128 tokens.

This PR handles this by creating compression plans (`comp_plan` in the code) which are created by the context and executed on the GPU. There are some extra writes to maintain graph topology for graph reuse.

These two caches are `llama_kv_cache` objects but they are always non-unified (i.e. stream aware). The slots are managed by the context.

Every layer also has a SWA cache, we use a `llama_kv_cache_iswa` wrapper for this to expose only the SWA. So attention is `[swa entries | compressed block entries]`

Perf on a DGX Spark:

`./build/bin/llama-bench -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf -dio 1 -fa 1 -ub 2048 -p 2048 -n 32`

```
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
| model | size | params | backend | ngl | n_ubatch | fa | dio | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --: | --: | --------------: | -------------------: |
| deepseek4 ?B IQ2_XXS - 2.0625 bpw | 80.76 GiB | 284.33 B | CUDA | -1 | 2048 | 1 | 1 | pp2048 | 147.95 ± 3.93 |
| deepseek4 ?B IQ2_XXS - 2.0625 bpw | 80.76 GiB | 284.33 B | CUDA | -1 | 2048 | 1 | 1 | tg32 | 5.99 ± 0.51 |
```

## TODOs

Mainly performance improvements
- [ ] MTP
- [ ] `-sm tensor`
- [ ] Lightning indexer (#24231)
- [ ] HC/sinkhorn ops to reduce no of graph nodes

## Credits

Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes

## Additional information

## Requirements

- I have read and agree with the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md)
- AI usage disclosure: YES, paired with both codex and claude.

哦，记得deepseek当前框架不支持kv cache设置到q8_0，请最低保持ctk 是f16，为了显示质量，kv 默认是f16，不动也行，否则你会遇到grabage信息输出，我差一点就把下载的80多G模型给删了 ^-^

最新回复 (1)

牛就是牛 07-03 19:33

1楼

坛子里能玩这个的佬友有几个？可以统计一下

* 帖子来源Linux.do

附近帖子

↑关于Apple ID 与 OpenAI 账号的绑定
↑【九幺】【这回真能用了】GPT5.5回来了，蹬死为止，1000刀号池
↑【菲区GPT pro 20x】开通记录留档
↑尼日利亚小鸡推荐或者节点
↑如何快速找到L站的高质量文章
📍 终于把DeepSeek V4 flash给部署起来了，llama.cpp终于适配起来了，gkd
↓【长期主义】板块申请
↓[烁]公益站使用问题
↓请教大家一个硬盘分区的问题
↓最近在深入学习数据分析相关课程，求推荐
↓威胁平面设计等远程工作者：AI 自动化 16.1% 项目已被攻克