#Inference Optimization

4 articles with this tag

FlashRT: Execution State for Latency-First AI

FlashRT revolutionizes on-device AI serving with execution-state capsules, enabling sub-millisecond state restoration and significant TTFT speedups for latency-critical applications.

about 1 month ago

Technology

Cloudflare Unweights LLMs by 22%

Cloudflare's 'Unweight' system slashes LLM model sizes by up to 22% using lossless compression, enhancing inference speed and efficiency.

3 months ago

AI Research

LLM Adaptation Without Retraining

In-Place Test-Time Training enables LLMs to adapt to new data at inference without retraining, enhancing performance and paving the way for continual learning.

3 months ago

AI Research

GPT-OSS-Puzzle-88B: Faster AI, Same Brains

GPT-OSS-Puzzle-88B offers substantial inference speedups for large language models without sacrificing accuracy, utilizing techniques like MoE pruning and window attention.

5 months ago