LLM Inference Scope

论文雷达

面向大模型推理加速、KV cache、attention、调度、并行和部署优化的最新论文深度摘要。

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

清华大学 / Universidade de Lisboa / Instituto de Telecomunicações / Carnegie Mellon University / Sapienza University of Rome / University of Edinburgh / TransPerfect / ELLIS Unit Lisbon（高校/研究机构/厂商）2026-05-18MoE / SparseOperatorsAttention

DashAttention 针对 NSA、InfLLMv2 这类层级 sparse attention 的两个痛点：Top-K block 数固定，以及 coarse sparse 选择和后续 dense softmax 之间不可导。它用自适应稀疏的 alpha-entmax 在第一阶段按 query 动态选择可变数量 KV block，并把这个选择作为第二阶段 softmax 的先验，使层级注意力端到端可导。论文称在 75% sparsity 下保持接近 full attention 的精度，高稀疏区间 Pareto 优于 NSA/InfLLMv2，…

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

NVIDIA（厂商）2026-05-18KV CacheServingParallelism

LongLive-2.0 是面向长视频生成的完整训练到推理基础设施，而不是单点量化技巧。训练侧用 sequence-parallel AR training 和 Balanced SP，把 clean-history 与 noisy-target temporal chunks 配对，结合 SP-aware chunked VAE encoding；精度侧利用 NVFP4 降低显存和 GEMM 成本。推理侧在 Blackwell 上做 W4A4 NVFP4、KV cache NVFP4 量化和异步 streaming VAE decoding，非 Bl…

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

上海交通大学 / NVIDIA Research / 中国科学技术大学 / 中国科学院大学 / NUS / University of Waterloo / HKUST / HKU / ZGCA（高校/研究机构/厂商）2026-05-18KV CacheAttentionAscend

Incantation 不是传统 LLM 推理论文，而是多实体交互式视频 world model。它把自然语言作为动作接口，在每个 latent frame 约 0.25 秒粒度做文本条件控制，用 frame-local text cross-attention 支持多实体动作和跨实体迁移；系统侧通过 ODE-initialized Self-Forcing distillation、RoPE-decoupled sliding KV cache 支撑实时长时序 streaming。论文报告跨实体迁移 89% 对 43%、out-of-vocabula…

Latent Action Reparameterization for Efficient Agent Inference

Université de Montréal / University of Sydney / Fudan University / Yale University / DeepWisdom / UIUC / Amazon Science / Stanford University / HKUST(GZ) / Mila（高校/研究机构/厂商）2026-05-18ServingAgent

LAR 把 agent 推理成本从“系统怎么调度 token”推进到“动作空间本身是否过长”。它学习 compact latent action space，让一个 latent action 对应多步语义行为，从而把原本长串低层文本动作压缩成更短的决策 horizon。和手写 macro 或层级控制器不同，latent action 从 agent 轨迹中学习，并直接接入模型的规划与执行。摘要称在多类 LLM agent benchmark 上减少 action tokens 和 wall-clock inference time，同时保持或提升任务…

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

上海交通大学 / 山东大学 / 华中科技大学 / University of Tokyo / HKUST / 华南理工大学 / Shanghai AI Lab（高校/研究机构）2026-05-18KV CacheOperatorsAttention

Focused Forcing 处理 AR video diffusion 长 horizon 生成里的 KV cache 压缩问题。它指出同一生成 chunk 内不同 frame 依赖的历史 frame 不同，同一历史 frame 的相对时间变化会改变 attention 分数，不同 head 被 mask 后质量退化也不一样。方法上，它对每个生成 frame 结合 attention score 和历史帧 diversity score 选择更相关、更有区分度的历史，并按 head importance 分配预算。作为 training-free…

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

HKUST / 西安交通大学（高校）2026-05-18KV CacheServingScheduling

KVDrive 是长上下文 LLM 的多级 KV cache 管理系统，覆盖 GPU HBM、host DRAM 和 SSD。它没有继续追求更激进的稀疏压缩，而是从系统角度处理 cache placement、pipeline scheduling 和跨层级 data movement。核心判断是，当上下文和 batch 继续增长时，offloading 系统的瓶颈从“哪些 token 重要”转向“KV 在 HBM/DRAM/SSD 之间怎么搬、怎么和 CPU/GPU 计算重叠”。论文实现原型后，在保持精度的情况下，相比 SOTA 系统最高 1.74x…

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

Independent Researcher（独立研究者）2026-05-18KV CacheServingOperators

这篇把 KV eviction 放到统一的 globally capped decode-time harness 下比较，结论很有工程警示意义：很多策略的差异不如“结构性保护”重要。LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、Random 等在没有 prompt-boundary protection 时，在 6 个 pure-transformer 模型上几乎崩掉；给边界 token 预留 10% cache 后，在 LongBench 上 C=256、13% retention 可恢复 C=2048 refe…

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

University of Southern California / Florida International University（高校）2026-05-18MoE / SparseQuantizationOperators

这篇是用 LLM agent 从文本自动构建 fuzzy cognitive maps，并通过重叠 chunk 的 FCM 矩阵混合与 Bayesian de-chunking 做因果图推断，示例是 Thucydides Trap 叙事建模。它确实涉及 sparse causal chunk matrices，但核心是知识图谱/因果建模方法，不是 LLM 推理加速、KV cache、serving 调度或 agent 任务推理成本优化。

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

Virginia Tech / UC Berkeley / UIUC（高校）2026-05-18Long Context

这篇讨论 memory-equipped LLM agents 的 longitudinal safety，而不是推理加速。它提出 temporal memory contamination：同一个 agent 在多个独立任务中积累 memory 后，早期记忆会影响后续不相关任务的安全行为。方法上用 trigger-probe protocol 对不同 prefix length 的只读 memory snapshot 做评测，并用 NullMemory counterfactual 识别 memory-induced violations。结果显示…

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Together AI / University of Sydney / UIUC（厂商/高校）2026-05-18KV CacheServingQuantization

OSCAR 直指 INT2 KV-cache quantization 的可部署性。普通 Hadamard rotation 能压 outlier，但到 INT2 会因不对齐 attention 消费的协方差结构而崩。OSCAR 离线估计 attention-aware covariance，生成固定 rotation 和 clipping threshold，再配 custom INT2 attention kernel，兼容 paged KV-cache serving 和 fused kernel pipeline，可接入 SGLang/vLLM…

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

UIUC / VinUniversity / VinUni-Illinois Smart Health Center / DFKI / IMPRS-IS / University of Stuttgart（高校/研究机构）2026-05-17ServingMoE / SparseAttention

SparseSAM 针对 SAM 的 ViT image encoder latency 和显存问题，强调只压 attention 不够，因为 MLP 仍然 dense。它提出 training-free structured sparsification：Stripe-Sort Attention 用确定性的 Z-order permutation 把 dense attention 变成静态、硬件友好的 sparse pattern，避免动态 mask overhead；Residual-Consistency MLP 只让 informative…

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

University of Chicago / Tensormesh Inc. / Samsung Semiconductor / Microsoft Research（高校/厂商）2026-05-17KV CacheServingSpeculative Decoding

VeriCache 把 lossy KV cache compression 变成 lossless decoding：压缩 KV 负责 draft tokens，full KV 负责 verification，目标是输出和 full-KV-cache decoding 完全一致，同时接近压缩方法的吞吐。关键系统挑战是 full KV 不常驻 GPU，verification 需要把 full KV swap 进来；论文利用 compressed-KV decoding 的 HBM bandwidth-bound 与 full-KV swap 的 PC…

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

清华大学 / 京东（高校/厂商）2026-05-17KV CacheServingMoE / Sparse

FastOCR 针对 OCR VLM 的 dense document visual tokens。传统 prefill 阶段永久丢 token 在自然图像上可行，但 OCR 中每个 visual token 都可能对应字符或结构，永久 eviction 很容易造成灾难性精度损失。论文观察到 document parsing 的注意力是 temporally sparse 的：每步 decode 只关注小区域，且 fixation 会逐步移动。FastOCR 用 Focal-Guided Pruning 在 focal layers 动态选择相关 vi…

TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

University of Cambridge / Imperial College London（高校）2026-05-16KV CacheServingQuantization

TriAxialKV 面向 agentic inference 的 KV-cache 低比特量化。它指出 agent 上下文不是均质文本：token 重要性同时受时间新近性、模态（文本/图像）和语义角色（用户请求、工具调用、观察、推理）影响。方法上给每个 token 打 triaxial tag，按 tag 校准敏感度，在固定内存预算下分配 INT2/INT4 bitwidth，并实现包含校准、混合精度量化、内存管理和 fused Triton decode kernel 的端到端系统。Qwen3-VL-32B-Thinking 在 OSWorld c…

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

南京大学 / 阿里巴巴（高校/厂商）2026-05-16KV CacheMoE / SparseAttention

RTPurbo 试图把 full-attention LLM 快速转成 sparse long-context model。它的三个观察是：只有少数 attention heads 真正需要 full long-context；长程检索主要由低维子空间控制，可用 16 维 indexer 做相关 token retrieval；有效 token budget 强 query-dependent，因此 dynamic top-p 比固定 top-k 更合适。方法保留 retrieval heads 的 full KV cache，并引入轻量 token…

WOW-Seg: A Word-free Open World Segmentation Model

NKIARI Shenzhen Futian / 南开大学 / 四川农业大学 / 北京大学深圳研究生院（高校/研究机构）2026-05-16QuantizationAttentionFramework

WOW-Seg 是 open-world segmentation 模型，不是推理加速论文。它用 Mask2Token 把 image masks 转成视觉 token 并对齐 VLLM feature space，再用 Cascade Attention Mask 解耦不同实例，减少 inter-instance interference；同时构建 RR-7K，包含 7,662 类开放世界区域识别。结果上，LVIS semantic similarity 89.7、semantic IoU 82.4，并以约八分之一参数量超过此前 SOTA。

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

上海交通大学 / 香港中文大学（深圳）/ 深圳北理莫斯科大学（高校）2026-05-16ServingSchedulingOperators

GoodServe 面向 agentic LLM inference 的异构 GPU serving，优化目标不是单请求 latency，而是满足端到端 SLO 的 goodput。系统先估计请求 output length 和各 GPU serving status，再用 just-enough instance selection 做路由；运行中持续监控 active requests 的 SLO violation risk，并触发 runtime migration 应对动态变化。评估显示相对现有 routing 方法 goodput 最高提升…

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

复旦大学 / 上海创新研究院 / 字节跳动（高校/研究机构/厂商）2026-05-16KV CacheServingParallelism

PA-BDM 针对 document recognition 的 block diffusion decoding。现有 BDM 以固定 block 绑定 denoising 和 cache commitment，导致 block 内并行度逐渐缩小，而且必须整块完成后才能写入 KV cache；同时 block 内 bidirectional denoising 与 block 间 autoregression 的信息流不一致。PA-BDM 改成 prefix-to-suffix causal denoising，把 block size 视为候选上限…

SpaceMoE: Towards Orbital General Intelligence with Distributed Mixture-of-Experts Inference

University of Hong Kong / Xidian University（高校）2026-05-16SchedulingMoE / SparseQuantization

SpaceMoE 是面向卫星网络的分布式 MoE inference 综述/范式文章。它讨论在轨设备受限于内存、算力、能耗、热限制和动态拓扑，MoE 的 sparse expert activation 可能让大模型在星座网络中分布式运行。文章聚焦 expert placement、expert selection、hidden-state transmission/routing 三个问题，并强调电池衰减、链路变化、热约束会重塑 MoE 调度。

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Seoul National University（高校）2026-05-16KV CacheServingMoE / Sparse

CompactAttention 专门处理 chunked prefill 下的 sparse attention 效率。传统 block-sparse 方法多为 one-shot prefill 设计，query length 被 chunk size 限制后 kernel 效率变差；细粒度 pattern search 又要在每个 chunk 对累计 KV 反复做，开销高。CompactAttention 把 2D block-sparse mask 当作 KV-selection signal，而非直接作为 sparse-kernel execu…

Lever: Speculative LLM Inference on Smartphones

清华大学 / 北京航空航天大学 / University of Pittsburgh（高校）2026-05-16ServingParallelismSpeculative Decoding

Lever 把 speculative decoding 放到智能手机 flash-backed LLM inference 场景里。目标模型太大无法常驻 DRAM，只能放 flash，但自回归 decode 反复调用 target model 会带来高 I/O。Lever 让小 draft model 常驻 DRAM，大 target model 从 flash 做多 token verification；同时针对移动端 I/O 慢、并行计算有限、speculation 执行不规则的问题，联合优化 drafting、verification 和 ex…

HexAGenT: Efficient Agentic LLM Serving via Workflow- and Heterogeneity-Aware Scheduling

HKUST / 微众银行 / 武汉大学 / 清华大学（高校/厂商）2026-05-15KV CacheServingScheduling

HexAGenT 面向 agentic LLM workflow 的异构 prefill-decode disaggregated serving。它把请求建模为运行时逐步揭示的 DAG，用户感知的是 workflow-level latency，而不是单次 LLM call latency。调度器维护 workflow standalone completion horizon 的估计，并按错过 horizon 的风险给 ready calls 排优先级，同时联合选择 prefill 位置、decode 位置和本地队列优先级，显式考虑 KV-cach…

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

University of Central Florida（高校）2026-05-15KV CacheMoE / SparseOperators

ARL2 处理 autoregressive video diffusion 中 cross-frame softmax attention 的二次复杂度和 KV cache 线性增长。它把 self-attention 分成两支：intra-frame softmax 保留空间细节和局部依赖，inter-frame gated recurrent linear branch 用固定大小 recurrent state 表示跨帧记忆。为避免 noisy intermediate state 污染记忆，只在 denoised pass 后更新 recur…

Property-Guided LLM Program Synthesis for Planning

Federal University of Rio Grande do Sul / University of Oxford / Linköping University（高校）2026-05-15Long Context

这篇把 LLM 程序合成放进 PDDL 规划，关键不是让模型一次生成更长代码，而是把外部 evaluator 从“打分器”升级成 formal property checker。候选启发式函数一旦违反 direct heuristic 性质，验证器立即停止并返回具体 counterexample，LLM 下一轮只修这个失败状态，而不再盲目批量采样。实验覆盖 IPC 2023 Learning Track 的 10 个规划域和 OOD test set，论文报告相对最佳先前生成方法平均少生成 7 倍程序、候选评估计算量低数个数量级，并且更多任务无需搜索即…

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

中科院计算所 / 中国科学院大学 / 中国矿业大学（北京）/ ETH Zurich / CUNY（研究机构/高校）2026-05-15OperatorsAttentionLong Context

Echo-Forcing 处理交互式长视频生成里的 KV 记忆纠缠：稳定背景 anchor、近期动态和可召回历史如果放在同一 cache policy 下，prompt 切换时会出现旧背景污染、响应滞后和长程场景遗忘。它把记忆分成 Hierarchical Temporal Memory、Scene Recall Frames 和 Difference-aware Memory Decay 三层：前者维护 rolling early anchors 与漂移门控压缩，中间层把历史场景压成可检索的空间结构 KV，后者按新旧场景差异衰减冲突 token。VB…

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

香港中文大学 / City University of Hong Kong / 腾讯 / SMU / 人民大学高瓴人工智能学院（高校/厂商）2026-05-15KV CacheAttentionAscend

这篇是 block attention 从“人为切块技巧”走向可泛化长上下文机制的一套方案。作者先构造 SemanticSeg：3 万+语义分段样本，覆盖 books/code/web text/conversations 等 16 类，长度 2K-32K，用轻量 segmenter 自动切成相对自洽的块；再用 frozen full-attention teacher 蒸馏 block-attention student，并加入 block sink tokens、block dropout、token-level loss weighting，缓解…

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

上海交通大学（高校）2026-05-15KV CacheAttention

GHOST 针对 VGGT 类流式 3D 重建模型的 KV cache 线性增长问题，提出几何信号驱动的在线 token eviction。它不再只用 attention score 或固定 anchor frame，而是利用模型自身输出的 depth/point/camera 等 3D 几何信息做层级重要性评分：高层按几何贡献保留关键帧/关键 token，底层用 cosine similarity 做逐层预算分配，并用 privilege mechanism 保护特殊 token。论文在 7-Scenes、NRGBD、Bonn 等 benchmark…

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

厦门大学 / 阿里巴巴（高校/厂商）2026-05-15KV CacheServingScheduling

FashionChameleon 面向电商/内容创作里的交互式换装视频，要求用户在生成过程中切换 garment 仍保持人体运动和身份一致。它先用单参考服装训练 teacher，再用 streaming distillation 训练学生模型，最后在推理期引入 Training-Free KV Cache Rescheduling，包括 garment KV refresh、historical KV withdraw、reference KV disentangle，让新服装 token 能快速接管，同时撤回与旧服装冲突的历史 KV。论文报告单 GP…

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

华为（厂商）2026-05-15ServingMoE / SparseQuantization

ElasticDiT 是端侧高分辨率图像生成的硬件自适应 DiT。它把一个模型做成可弹性权重库，通过 Dynamic VAE Routing 调 latent 分辨率、Sparse-Depth Pruning 调 DiT block 深度，再用 Unified Weight Co-Optimization 保证 max/lite 路径共享参数。效率侧加入 Shift Sparse Block Attention，平均稀疏度 84.16%，以及 Tiny DWT-Distilled VAE，声称达到 SD3 级重建但仅需标准 VAE 1/8 计算量。结果…

DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

清华大学 / 鹏城实验室 / 成功大学 / 大湾区大学（高校/研究机构）2026-05-15ServingQuantizationOperators

DealMaTe 做材料迁移，但推理侧有明确的多条件 attention 优化。模型不用文本 prompt 或 reference network，而是用 depth/normal/lighting 三类 3D shader 条件，通过 Multi-Dim 3D Shader LoRA 注入几何、法线、光照信息；在 attention 中用 Shader Causal Mutual Attention 管理多条件交互，并在初始 diffusion step 只计算一次条件特征的 K/V，后续 denoising steps 复用缓存。量化指标里 Dea…

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

City University of Hong Kong / 华为（高校/厂商）2026-05-15Speculative DecodingOperators

PSD 针对 diffusion LLM 的两类推理瓶颈同时下手：空间上一次 forward 选择多个高置信位置 unmask，时间上从同一次预测构造多深度 speculative drafts，不额外调用模型；最后用 batched verification 和 hierarchical acceptance 保留与更新预测一致的最深草稿。它把 diffusion 解码的“多步去噪”改成空间并行 + 时间投机的组合，三类 dLLM 在 reasoning/code generation 上最高达到 5.5x tokens per forward pa…

Measuring Maximum Activations in Open Large Language Models

上海交通大学 / 百度 / 南开大学（高校/厂商）2026-05-15MoE / SparseQuantizationOperators

这篇是量化/稳定推理的基础测量工作：作者在 5000 样本多域语料上，统一 hook 27 个现代开源 checkpoint、8 个模型家族的 embedding、hidden states、attention、MLP/MoE、SwiGLU gate、final norm，测 activation maxima。结论很重要：同等参数规模下全局最大 activation 可跨近 4 个数量级；Qwen3.5 与不少 MoE checkpoint 多在 1e2-1e3，而 Gemma3-27B-it 可到约 7e5。作者还指出跨家族/跨代际并不满足简单单调…

STS: Efficient Sparse Attention with Speculative Token Sparsity

香港科技大学 / UC Berkeley（高校）2026-05-15Speculative DecodingMoE / SparseAttention

STS 把 speculative decoding 的 draft model 从“只生成草稿 token”扩展成“提供 target attention 稀疏先验”。核心观察是小 draft model 判断重要的 token/head，对大 target model 的重要 token 有预测性；因此可直接复用 draft attention scores，构造 token-and-head-wise sparsity mask，裁剪 target LLM 的注意力计算。论文面向百万 token 级 agentic 应用，报告在 Narrativ…

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

UIUC / AMD（高校/厂商）2026-05-14KV CacheAttentionAscend

KVCapsule 针对 VLM 中视觉 token 带来的 KV cache 膨胀。作者先分析视觉 token 与文本 token 在空间结构和信息密度上的差异，指出很多 LLM KV 压缩方法直接迁移到 VLM 会失效。KVCapsule 冻结 pretrained VLM backbone，不修改 attention computation modules，而是在外部加入轻量 compression/reconstruction 组件，对 vision tokens 做结构感知压缩。多 VLM 和 benchmark 上，在 60% compre…

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

AWS / Google / University of Minnesota（厂商/高校）2026-05-14ParallelismMoE / SparseOperators

DualKV 解决 RL 后训练 rollout 的共享 prompt 重复问题。GRPO/DAPO 常对同一 prompt 采样 N 条 response；当 N≥16、P≥8K 时，标准 FlashAttention 在 forward/backward 都把 P 个 prompt token 复制 N 次，长 prompt 成为 policy update 主成本。DualKV 证明 decoder-only causal mask 下 prompt 表示在各 response 间不变，于是把 micro-batch 从 N(P+R) repac…

HoloMotion-1 Technical Report

地平线机器人（厂商）2026-05-14KV CacheMoE / Sparse

HoloMotion-1 是地平线的人形机器人运动基础模型，主线不是 LLM，但有实时控制推理的系统信号。它用大规模视频重建动作 + MoCap/内部数据混合训练，policy 采用 sparsely activated MoE Transformer，并在闭环控制中使用 KV-cache inference，使模型容量提升但每个控制步只激活部分专家。论文报告 MoE Transformer + KV cache 将推理效率最高提升 4x；sequence-level PPO 按 motion segment 而不是逐 timestep 训练，长 cl…

From I/O to Code with Discovery Agent

北京大学 / 阿里巴巴通义实验室 / 武汉大学 / 人民大学 / NUS / 上海交通大学（高校/厂商）2026-05-14OperatorsAscend

DIO-Agent 研究 IO2Code：只从输入输出行为反推程序，比 NL2Code 更像黑盒 API/遗留系统逆向。方法把 LLM 放进 evolutionary search，作为 mutation operator；执行错误信号进入下一轮 prompt，并用 Transformation Priority Premise 约束搜索从常量、变量、条件到循环逐级升级，避免一开始生成复杂但过拟合的程序。作者还构造 IO2CodeBench，覆盖多难度任务；实验称 DIO-Agent 在不同 LLM 和难度级别上超过传统 program-by-exam…

An Interpretable Latency Model for Speculative Decoding in LLM Serving

MIT / Red Hat AI（高校/厂商）2026-05-14ServingSpeculative DecodingMoE / Sparse

这篇给 speculative decoding 一个可解释 latency model，用 Little’s Law 从 request rate 推 effective batch size，再把 prefill、drafting、verification 分成 load-independent 和 load-dependent demand。它用 vLLM 在不同 verifier/drafter 尺寸、prefill/decode 长度、请求率、draft length、acceptance probability 下做测量，解释为什么 SD…

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

北京大学（高校）2026-05-14KV CacheServingParallelism

GQLA 针对 MLA 的硬件绑定问题：DeepSeek-V2/V3 的 MLA 在 H100 上因 latent KV 压缩接近 roofline，但训练权重只暴露 MQA-absorb 解码路径，导致 H20 这类算力/带宽比不同的卡上 compute-bound，且 head-axis tensor parallelism 要复制 latent KV、MTP 也拿不到收益。GQLA 在同一训练权重上保留两条代数等价路径：H100 用 MQA-absorb，H20 用 per-group expanded cache 的 GQA + MTP；推荐…

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

Jie Jiang 等作者团队（作者团队）2026-05-14Speculative Decoding

PPOW 把 speculative decoding 的 drafter 训练目标从 token-level imitation 改成 window-level performance。原因是 SD 的收益由一个 speculative window 中连续 accepted prefix 决定，某个 hard-to-draft token 会让后续 token 全部作废。PPOW 用 Cost-Aware Speedup Reward、Distribution-Based Proximity Reward 和 Adaptive Divergence…

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Jonathan Cederlund 等作者团队（作者团队）2026-05-14KV CacheServingAttention

HeatKV 面向 Visual Autoregressive image generation 的 KV-cache memory：VAR 每张图可能需要 GB 级 KV。它用小校准集按 head 对历史 scale 的 attention score 排名，为给定 memory budget 生成静态 head-specific pruning schedule。Infinity-2B 上，相比已有 KV 压缩方法达到 2x 更高 compression ratio，同时保持相似或更好的 image fidelity、prompt alignmen…

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Gemini Stiftung（研究机构）2026-05-14ServingMoE / SparseQuantization

XFP 是质量目标驱动的 LLM weight quantizer：操作者给 per-channel cosine similarity 下限，系统自动选 codebook size、outlier budget 和 packing，不需要 Hessian、校准集或手选 bit-width。每个矩阵拆成 sparse fp16 outlier residual 和 dense sub-byte codebook index，V2/V2a 共用 fused decode kernel。Qwen3.5-122B-A10B 在 RTX PRO 6000 Bl…

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Xuzhe Zheng 等作者团队（作者团队）2026-05-14MoE / SparseOperatorsAttention

HASTE 是视频 DiT 的训练无关 sparse attention 加速。作者指出在线 top-p sparse attention 的 mask prediction 也有成本，而且不同 head 稀疏阈值差异很大；因此做 Temporal Mask Reuse，根据 query-key drift 跳过不必要 mask 预测，再用 Error-guided Budgeted Calibration 给每个 head 分配 top-p threshold。Wan2.1-1.3B 和 14B 上，作为 XAttention/SVG2 插件，在 7…

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

西湖大学 AGI Lab / UC Merced / 阶跃星辰（高校/厂商）2026-05-14KV CacheOperatorsAttention

Head Forcing 是 AR 视频扩散的 head-aware KV policy：作者把 attention head 分成 local、anchor、memory 三类，local/anchor 只保留必要 token，memory head 用 hierarchical memory 和 episodic update 保持远程一致性，并用 head-wise RoPE re-encoding 避免位置超出预训练范围。它训练无关，把生成从 5 秒扩展到分钟级，并支持多 prompt 交互式合成。对 LLM 推理不是主线，但它提供了“按 he…

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

Juntong Wu 等作者团队（作者团队）2026-05-14ServingSchedulingMoE / Sparse

BEAM 解决 MoE 固定 Top-K routing 的冗余 expert 激活。它用 trainable binary mask 做 token-adaptive expert selection，straight-through estimator 让二值门可训练，辅助正则控制稀疏度；同时给出 custom CUDA kernel 并接入 vLLM。实验显示保留原模型 98% 以上性能，同时 MoE layer FLOPs 最多降 85%，decode 最多 2.5x，throughput 最高 1.4x。对 MoE 推理很值得深读，但它不是纯…

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

Zhiye Song 等作者团队（作者团队）2026-05-14ServingParallelismMoE / Sparse

EnergyLens 是 multi-GPU LLM inference 的能耗建模和配置探索工具：用 einsum-like 接口描述模型、fusion、parallelism、compute-communication overlap，并显式建模 MoE load imbalance 和通信能耗。它在 Llama3、Qwen3-MoE 的 TP/EP 配置上验证，multi-GPU prefill/decode 能耗 MAPE 为 9.25%-13.19%，Megatron 风格 overlap 的 SM allocation MAPE 为 12.…

PreFT: Prefill-only finetuning for efficient inference

Stanford University / Tilde Research（高校/研究机构）2026-05-14ServingOperatorsFramework

PreFT 针对多用户 PEFT serving 的 decode 吞吐瓶颈：adapter 在 prefill 批量 token 上较容易摊销，但 decode 单 token 时多 adapter 会拖垮吞吐。它只在 prefill tokens 应用 LoRA/ReFT adapter，decode 阶段丢弃 adapter，让 personalization 信息通过 hidden/KV 状态进入后续生成。vLLM 实现上，Llama 3.1 70B 同时服务 512 个 adapter 时吞吐达到传统 PEFT 的 1.9x；SFT 上 lo…

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

LayerScale, Inc.（厂商）2026-05-13KV CacheServingScheduling

Stateful Transformers 把流式场景从 request-driven 改成 data-driven：session 持久维护 KV cache，新数据到来时增量推进状态，把每次查询的 O(n) prefill 从关键路径移走，查询延迟近似变成 O(|q|)。Flash Queries 进一步利用数据到达间隙预评估注册问题；多租户 scheduler 用 cell-budget admission 和 prefix-aware grouped prefill 让多个 stateful session 共存。市场数据流 benchmark…

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

中科院计算所 / 中国科学院大学 / 上海交通大学（研究机构/高校）2026-05-13KV CacheServingParallelism

KVServe 面向 disaggregated LLM serving：PD separation 或 KV state disaggregation 会把 KV 变成跨网络/存储边界传输的显式 payload，固定压缩策略在 workload、带宽、SLO/质量预算变化时很容易失效。它把多种 KV compression 组合成模块化策略空间，用 Bayesian Profiling Engine 搜 3D Pareto set，把离线搜索开销降 50x，再用 latency model + lightweight bandit 的在线 contr…

FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

南京理工大学 / 北京科技大学 / 中北大学（高校）2026-05-13MoE / SparseOperatorsAttention

FSCM 是红外高光谱图像着色，不是 LLM 推理论文。它把任务建模成 spatial-spectral-frequency coupled generation：生成器级联 FSB，每个 FSB 结合视觉 state-space/Mamba 分支、频域增强模块（小波 + Fourier gating）和多域融合，并用在线语义分割损失约束道路场景结构。实验使用 PSNR、SSIM、NIQE、UIQI 等指标，对比 pix2pix、ToDayGAN、LKAT-GAN、DDGAN、MUGAN、MornGAN、VOS 等方法，论文称视觉质量和语义保真更好。

SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

Vladislav Makarov 等作者团队（作者团队）2026-05-13ServingFramework

SceneGraphVLM 是应用级 VLM 加速：用 token-efficient TOON 格式序列化 scene graph，减少长文本输出和无关 object/relation；训练上先 SFT，再用 hallucination-aware reward 做 RL，平衡 relation coverage/precision 并惩罚无支撑对象。视频场景可把上一帧 graph 当轻量上下文，无需 tracking/post-processing；PSG、PVSG、Action Genome 上，小 VLM 配合 vLLM decoding 约…

Z-Order Transformer for Feed-Forward Gaussian Splatting

香港大学 / Futurewei（高校/厂商）2026-05-13ServingMoE / SparseAttention

Z-Order Transformer 是 feed-forward Gaussian Splatting 的实时化论文，不是 LLM serving。它用 Z-order 把无结构 Gaussian set 排成空间连续序列，让 Transformer sparse attention 更容易捕捉空间/语义关系，同时自适应压掉冗余 Gaussian primitive，在单次 forward 中预测属性。摘要只说 fewer primitives 和 fast high-quality novel view synthesis，未给具体时延/质量数值…

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

Vage Egiazarian 等作者团队（作者团队）2026-05-12Quantization

Grid Games 研究 microscaled 4-bit quantization 的多网格选择：MXFP4/NVFP4 通常每组共享 scale 且固定浮点网格，作者允许每组在两个或更多 4-bit grids 中选更合适的一个，并用 scale 中的额外 bit 标记。理论上 PO2 grids 对小 group 明显有利，大 group 优势消失；实践给出 PO2(NF4)、MPO2、PO2(Split87)、SFP4 等格式，在标准开源模型 PTQ 和 Llama-like pretraining 中，相比单 grid FP4 持续提升…

BFLA: Block-Filtered Long-Context Attention Mechanism

Chong Wu 等作者团队（作者团队）2026-05-12KV CacheServingMoE / Sparse

BFLA 是训练无关的长上下文 prefill sparse attention：先把 Q/K 压成粗块，用 block-level softmax mass 估计重要 KV block，再展开到 Triton tile grid，由 fused sparse prefill kernel 跳过低价值 tile，但保留被选 tile 内的精确 token attention；local band、sink、speculative rescue 用来兜住信息损失。论文在 Gemma 4、Llama 3.1、Qwen 3.5/3.6 上替换 full-at…

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

上海交通大学 / 蚂蚁集团（高校/厂商）2026-05-12KV CacheServingMoE / Sparse

AB-Sparse 针对 block sparse attention 的一个实际问题：不同 attention head 对 block 粒度敏感度不同，统一 block size 会让部分 head 精度掉得过多。它做了训练无关的算法-系统协同：按 head 分配自适应 block size，用 lossless block centroid quantization 补偿额外元数据/内存开销，再配 variable-block custom GPU kernel 执行。实验相对固定 block sparse baseline 最高提升 5.43%…

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

Peipei Xu 等作者团队（作者团队）2026-05-12Framework

DIPS 把 7B LLM 微调成 constrained bi-objective convex optimization 的 amortized Pareto-front generator：输入文本问题，直接输出一组可行连续决策向量，避免每个实例重复 scalarization/evolutionary search。为适配自回归建模，它使用紧凑离散化、Numerically Grounded Token Initialization 和三阶段 curriculum。五类问题上 normalized hypervolume 达到参考前沿的 95.…

Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

中国科学院信息工程研究所 / 安阳师范学院 / 南开大学（研究机构/高校）2026-05-12MoE / SparseFramework

Chronicles-OCR 是中文古文字跨时代感知 benchmark，不是推理加速论文。数据集覆盖七种中文书体演化轨迹，2800 张严格平衡图像，介质从甲骨到纸本书法；任务包括跨时期 character spotting、细粒度古字识别、古文解析和书体分类，目标是隔离视觉感知和语义推理。它对 VLM 鲁棒性研究有价值，但没有提出降低 token、KV、attention 或 serving latency 的技术。本站建议低优先级，仅在关注 OCR/VLM 评测或中文历史文本应用时阅读。

Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

Jie Chen 等作者团队（作者团队）2026-05-12ServingMoE / SparseAttention

SAGL 研究 heterogeneous multiview data 的 sparse attention graph learning，用 bilinear attention factorization、dynamic sparsity gating 和 α-entmax sparse projection 学 subspace-preserving graph。它有稀疏 attention 关键词，但目标是无监督迁移/多视图表示，不是 LLM inference 的 attention kernel 或 KV cache 优化。对本站只建议低…

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Yihao Wang 等作者团队（作者团队）2026-05-12ServingMoE / SparseAgent

MedMemoryBench 是医疗 agent memory benchmark，不是推理加速方法。它用人机协作合成临床 grounded 的长期病程轨迹，约 2000 sessions、16000 turns，并提出 evaluate-while-constructing 的流式评估，模拟生产中 memory 不断积累；还系统研究 memory saturation，即信息持续涌入后检索和推理鲁棒性下降。对本站价值在 agent memory/KV/外部记忆系统的评测口径，尤其是长期会话的噪声、遗忘和安全；但它没有给出降低 token、KV 或 s…

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Xianpeng Shang 等作者团队（作者团队）2026-05-12ServingOperatorsAttention

这篇解决长上下文训练/推理语义不一致：很多方法训练用 full-context attention，推理才切成 segment 或 bounded context，导致状态转移和梯度语义不匹配。作者让训练和推理都按 segment-level forward 执行，训练时只让梯度穿过前一段携带的 KV state，但 forward 可以让不同 head 访问更早 KV。这样在长上下文 benchmark 上接近 full attention，同时在 128K 下 peak prefill memory 相比 full-context FlashAtt…

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

百度（厂商）2026-05-12ServingSchedulingOperators

Ada-MK 是工业 decode MegaKernel：在线广告场景 latency 到毫秒级，decode 每 token 会触发成千上万 kernel launch，launch overhead 可占端到端 14.6%。Ada-MK 观察固定部署配置下最佳执行路径可在编译期确定，于是用三维 shared-memory 约束模型和 K-dimension splitting 把峰值 shared memory 降 50%，再用 MLIR DAG 离线搜索固化路径，去掉 runtime branch；最后作为插件嵌入 TensorRT-LLM，pr…

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

Han Meng 等作者团队（作者团队）2026-05-11ServingSchedulingParallelism

ChunkFlow 关注 DiT layerwise offloading：当每 GPU compute 很小，或者 PCIe-only 节点上 prefetch 与 all-reduce/all-to-all 争 PCIe，传统“把预取藏在计算后面”的假设会失效。它用一阶模型判断 prefetch 能否被隐藏，再做 chunk-granular offload runtime，通信到来时自适应让路，并用 chunk 大小平滑交换显存和延迟。两张 H100 PCIe、Ulysses sequence parallelism、三类 DiT 上，相比 SG…

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

Stanford University / NVIDIA（高校/厂商）2026-05-11ServingSchedulingParallelism

Sieve 讨论 MoE + HBM-PIM 的动态调度：现代 MoE 越来越多专家、每 token 激活更少，专家流量呈双峰分布，热门专家算术强度高，长尾专家只有少量 token，静态 PIM offload 规则会失效。Sieve 根据 runtime token-to-expert 分布，在 GPU 和 PIM 间划分 expert 执行，同时考虑 interconnect、memory bandwidth、GPU/PIM throughput，并重叠 GPU compute、PIM compute、跨设备通信。Ramulator 2.0 cyc…

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Prathamesh Vasudeo Naik 等作者团队（作者团队）2026-05-11KV CacheServingScheduling

这篇把 AML/反欺诈场景当作专门的 LLM serving workload，而不是普通聊天流量：prompt 前缀长、政策/风险 taxonomy 可复用、输出短且结构化。栈里组合 vLLM 式调参、PagedAttention、Automatic Prefix Caching、多 adapter serving、按 adapter 和 prompt 长度 batching、sleep/wake、speculative decoding，以及可选的 prefill/decode disaggregation。公开合成 AML 负载上，吞吐从 612…

Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing

University of Maryland / New York University（高校）2026-05-11KV CacheServingScheduling

GRIEF 把 LLM serving 的并发状态当成安全边界来 fuzz，而不是只测模型输出。它生成 timed multi-request trace，覆盖 KV cache、batching、prefix sharing、speculative decoding、adapter 和多租户 scheduling 的组合状态；用轻量 oracle 发现 crash、hang、性能病态和 silent output corruption，再用 log-prob replay 确认可复现。早期在 vLLM/SGLang 上发现 15 个漏洞，10 个被开…

Compute Where it Counts: Self Optimizing Language Models

Yash Akhauri 等作者团队（作者团队）2026-05-11MoE / SparseQuantizationAttention

SOL 做 token-level dynamic compute allocation：不是统一量化/剪枝，而是冻结 LLM 后加轻量 policy network，读取 hidden state 并在每个 decode step 选择 efficiency action，可同时控制 attention sparsity、MLP structured activation pruning 和 activation bit-width。训练用 GRPO 式 counterfactual schedules：固定 token 序列，采样不同 comput…

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

清华大学 / 字节跳动 / Approaching AI / 阿里云 / 京东 / Georgia Tech（高校/厂商）2026-05-11ServingSchedulingParallelism

EEP 解决 wide expert-parallel MoE serving 的故障恢复：传统 EP 把 rank membership 固定在初始化，rank 失效会同时破坏 communicator、expert placement 和 CUDA graph 中的 routing metadata，通常只能整实例重启。EEP 把 membership 变成可变 runtime state，分别修复 peer reachability、丢失 expert coverage，并在 repaired rank 回来时避免健康 rank 重抓 CUDA…

Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science

Karlsruhe Institute of Technology（高校）2026-05-11ServingOperatorsAscend

这篇不是 LLM 推理论文，而是大科学触发系统里 GNN 的实时部署示范；价值在算子融合、映射、空间并行和 kernel-level 优化的方法论。作者把动态 GNN 部署到 AMD Versal VCK190 的 FPGA fabric + AI Engine tiles，吞吐 2.94M events/s，端到端延迟 7.15us；相比 FPGA-only baseline 吞吐提升 53%，DSP 使用率从 99% 降到 19%，AI Engine tile 使用率 29%。对本站读者只建议低优先级跟踪：可借鉴异构算子划分和实时可视化验证，不应占…

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

北京大学 / 字节跳动 Seed（高校/厂商）2026-05-11ServingParallelismSpeculative Decoding

SPEX 把 Tree-of-Thought 的 reward-guided search 当成并行化问题：传统 ToT 要等 reward 再展开分支，形成同步屏障。SPEX 做三件事：intra-query speculative path selection 预测高潜分支并提前展开，inter-query budget allocation 在请求间动态分配 speculative 资源，adaptive early termination 剪掉深而冗余的分支。它基于 SGLang 实现，在多种 ToT 算法和 LLM 上得到 1.2-3x sp…