LIBRISTO
LIBROAMANTO
mandatory
Become part of a community of book lovers from all over the world and get access to a whole bunch of benefits. Create an account for free
0
Free delivery for purchases over 69.99 €
DPD courier 5.99 Bpost point 7.99 Bpost 7.49 DPD point 3.49 GLS courier 4.49

Free delivery for orders over 69.99 euro.

AI Inference Optimization Engineering

Quantization, Speculative Decoding, and Hardware-Specific LLM Deployment

Language EnglishEnglish
Book Paperback
Book AI Inference Optimization Engineering ChatVariety Team
Libristo code: 52770465
Publishers Independently published, June 2026
Slash LLM Deployment Costs and LatencyDeploying Large Language Models (LLMs) in production is a mass... Full description
? points 28 b Coming soon Coming soon New New
11.41
Expected in stock Expected 07. 06. 2026

30-day return policy

Slash LLM Deployment Costs and Latency

Deploying Large Language Models (LLMs) in production is a massive economic and engineering hurdle. AI Inference Optimization Engineering is your comprehensive, hands-on guide to mastering the full stack of modern LLM optimization techniques. From memory-bandwidth solutions to hardware-specific compilation, this book bridges the gap between research-level models and enterprise-grade execution.

What you will master inside this book:
  • Hardware-Aware Optimization: Dive deep into KV cache mechanics, autoregressive decoding, and GPU memory hierarchies to eliminate latency bottlenecks.
  • State-of-the-Art Quantization: Apply GPTQ, AWQ, and GGUF compression algorithms to scale down massive neural networks without sacrificing model accuracy.
  • Advanced Acceleration Methods: Implement speculative decoding with draft models (like Medusa and Eagle), PagedAttention, and FlashAttention to boost throughput by 2-3x.
  • Production-Grade Serving: Build ultra-low-latency deployment infrastructures using vLLM, Triton Inference Server, and continuous batching.
  • Cross-Platform Deployment: Optimize models for specific target hardware, including NVIDIA H100 (TensorRT-LLM), Apple Silicon (llama.cpp/Metal), and Qualcomm mobile/edge accelerators.

Whether you are an ML infrastructure engineer, an AI platform architect, or a technical leader looking to scale LLMs cost-effectively, this book provides the production-ready code, equations, and architectural patterns you need to build hyper-efficient AI pipelines.

Actress & Polyglot
EWA KASP for
Play video
Ewa Kasp
Libristo has the largest selection of foreign-language books. That’s why I buy my books there.

About the book

Full name AI Inference Optimization Engineering
Language English
Binding Book - Paperback
Date of issue 2026
Number of pages 96
EAN 9798199720021
Libristo code 52770465
Weight 142
Dimensions 152 x 229 x 5
Give this book today
It's easy
1 Add to cart and choose Deliver as present at the checkout 2 We'll send you a voucher 3 The book will arrive at the recipient's address

Login

Log in to your account. Don't have a Libristo account? Create one now!

 
mandatory
mandatory

Don’t have an account? Discover the benefits of having a Libristo account!

With a Libristo account, you'll have everything under control.

Create a Libristo account