#LLM Optimization
8 articles with this tag
TokenPilot: Reining in LLM Context Costs
TokenPilot offers a dual-granularity context management framework, slashing LLM inference costs by up to 87% while preserving performance.
Compute Once: Unlocking AI Agent Efficiency
A radical proposal to precompute LLM KV caches, slashing inference costs by up to 50x and enabling a new compute-efficient AI agent paradigm.
Unlocking Ultra-Long Context for LLMs
MiniMax Sparse Attention breaks the context window barrier for LLMs, enabling millions of tokens with significant compute reduction and practical speedups.
MobileMoE LLMs Redefine On-Device AI
MobileMoE LLMs redefine on-device AI, setting new performance and efficiency benchmarks for sub-billion parameter models on smartphones.

Faster LLMs by Reshaping Sparsity
Sakana AI and NVIDIA unveil a new method that reshapes sparsity in LLMs to boost GPU efficiency, achieving over 20% speedups.
LLM Reasoning Fix: LPSR
Latent Phase-Shift Rollback (LPSR) corrects LLM reasoning errors at inference with no fine-tuning, boosting accuracy and efficiency.
Prism: Symbolic Superoptimization for Tensors
Prism, a novel symbolic superoptimizer, uses sGraphs to represent tensor program families, achieving significant speedups and reduced optimization time for LLM workloads.
Beyond Token Count: Semantic Compression for LLMs
Researchers recast LLM reasoning as lossy compression using the Conditional Information Bottleneck (CIB), employing semantic surprisal for efficient token pruning.