DeepSeek V4 (#24162)
master ← am17an:dsv4
已合并 08:58AM - 29 Jun 26 UTC
## Overview
This PR adds support for the deepseek-v4 models. The most no…vel part of this architecture is the compressed attention. There are two types:
1. CSA (Compressed Sparse Attn) - it is a variant of DSA (introduced in DeepseekV3.2), it operates on the same principle of the lightning indexer to get top-k tokens to attend to, except tokens are "compressed tokens". A compressed token in CSA is every 4 tokens compressed into 1. It maintains a window of the last 8 tokens and does this at every 4 token boundary
2. HCA (Heavily Compressed Attn) - it is like normal attention over compressed tokens plus SWA, the compression being large at 128 tokens.
This PR handles this by creating compression plans (`comp_plan` in the code) which are created by the context and executed on the GPU. There are some extra writes to maintain graph topology for graph reuse.
These two caches are `llama_kv_cache` objects but they are always non-unified (i.e. stream aware). The slots are managed by the context.
Every layer also has a SWA cache, we use a `llama_kv_cache_iswa` wrapper for this to expose only the SWA. So attention is `[swa entries | compressed block entries]`
Perf on a DGX Spark:
`./build/bin/llama-bench -m DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf -dio 1 -fa 1 -ub 2048 -p 2048 -n 32`
```
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
| model | size | params | backend | ngl | n_ubatch | fa | dio | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --: | --: | --------------: | -------------------: |
| deepseek4 ?B IQ2_XXS - 2.0625 bpw | 80.76 GiB | 284.33 B | CUDA | -1 | 2048 | 1 | 1 | pp2048 | 147.95 ± 3.93 |
| deepseek4 ?B IQ2_XXS - 2.0625 bpw | 80.76 GiB | 284.33 B | CUDA | -1 | 2048 | 1 | 1 | tg32 | 5.99 ± 0.51 |
```
## TODOs
Mainly performance improvements
- [ ] MTP
- [ ] `-sm tensor`
- [ ] Lightning indexer (#24231)
- [ ] HC/sinkhorn ops to reduce no of graph nodes
## Credits
Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes
## Additional information
## Requirements
- I have read and agree with the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md)
- AI usage disclosure: YES, paired with both codex and claude.