惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

MarkTechPost

Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time What is Tokenization Drift and How to Fix It? Mistral AI Launches Remote Agents in Vibe and Mistral Medium 3.5 with 77.6% SWE-Bench Verified Score Build a Multi-Agent AI Workflow for Biological Network Modeling, Protein Interactions, Metabolism, and Cell Signaling Simulation A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset A New NVIDIA Research Shows Speculative Decoding in NeMo RL Achieves 1.8× Rollout Generation Speedup at 8B and Projects 2.5× End-to-End Speedup at 235B A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools A Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes A Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference Cursor Introduces a TypeScript SDK for Building Programmatic Coding Agents With Sandboxed Cloud VMs, Subagents, Hooks, and Token-Based Pricing Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 A Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters Top 10 Physical AI Models Powering Real-World Robots in 2026 How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo The LoRA Assumption That Breaks in Production How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models RAG Without Vectors: How PageIndex Retrieves by Reasoning A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost How to Design a Production-Grade CAMEL Multi-Agent System with Planning, Tool Use, Self-Consistency, and Critique-Driven Refinement Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks A Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering’ Photon Releases Spectrum: An Open-Source TypeScript Framework that Deploys AI Agents Directly to iMessage, WhatsApp, and Telegram OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow A Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer How TabPFN Leverages In-Context Learning to Achieve Superior Accuracy on Tabular Datasets Compared to Random Forest and CatBoost A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Top 19 AI Red Teaming Tools (2026): Secure Your ML Models A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research Building Transformer-Based NQS for Frustrated Spin Systems with NetKet UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size How to Build a Universal Long-Term Memory Layer for AI Agents Using Mem0 and OpenAI A Coding Implementation to Build Multi-Agent AI Systems with SmolAgents Using Code Execution, Tool Calling, and Dynamic Orchestration A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Google DeepMind Releases Gemini Robotics-ER 1.6: Bringing Enhanced Embodied Reasoning and Instrument Reading to Physical AI Google Launches ‘Skills’ in Chrome: Turning Reusable AI Prompts into One-Click Browser Workflows A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction TinyFish AI Releases Full Web Infrastructure Platform for AI Agents: Search, Fetch, Browser, and Agent Under One API Key NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction MiniMax Just Open Sourced MiniMax M2.7: A Self-Evolving Agent Model that Scores 56.22% on SWE-Pro and 57.0% on Terminal Bench 2 Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput How to Build a Secure Local-First Agent Runtime with OpenClaw Gateway, Skills, and Controlled Tool Execution How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model Alibaba’s Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim
An End-to-End Coding Guide to NVIDIA KVPress for Long-Context LLM Inference, KV Cache Compression, and Memory-Efficient Generation
2026-04-10 · via MarkTechPost

In this tutorial, we take a detailed, practical approach to exploring NVIDIA’s KVPress and understanding how it can make long-context language model inference more efficient. We begin by setting up the full environment, installing the required libraries, loading a compact Instruct model, and preparing a simple workflow that runs in Colab while still demonstrating the real value of KV cache compression. As we move through implementation, we create a synthetic long-context corpus, define targeted extraction questions, and run multiple inference experiments to directly compare standard generation with different KVPress strategies. At the end of the tutorial, we will have built a stronger intuition for how long-context optimization works in practice, how different press methods affect performance, and how this kind of workflow can be adapted for real-world retrieval, document analysis, and memory-sensitive LLM applications.

import os, sys, subprocess, textwrap, time, gc, json, math, random, warnings, inspect
warnings.filterwarnings("ignore")


def run(cmd):
   print("\n[RUN]", " ".join(cmd))
   subprocess.check_call(cmd)


run([sys.executable, "-m", "pip", "install", "-q", "--upgrade", "pip"])
run([sys.executable, "-m", "pip", "install", "-q", "torch", "transformers", "accelerate", "bitsandbytes", "sentencepiece", "kvpress==0.4.0"])


try:
   from google.colab import userdata
   hf_token = userdata.get("HF_TOKEN")
except Exception:
   hf_token = os.environ.get("HF_TOKEN", "")


if not hf_token:
   try:
       import getpass
       hf_token = getpass.getpass("Enter your Hugging Face token (leave empty if model is public and accessible): ").strip()
   except Exception:
       hf_token = ""


if hf_token:
   os.environ["HF_TOKEN"] = hf_token
   os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_token


import torch
import transformers
import kvpress


from transformers import pipeline, BitsAndBytesConfig
from kvpress import ExpectedAttentionPress, KnormPress


print("Python:", sys.version.split()[0])
print("Torch:", torch.__version__)
print("Transformers:", transformers.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
   print("GPU:", torch.cuda.get_device_name(0))


MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
MAX_NEW_TOKENS = 96
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)

We set up the Colab environment and install all required libraries to run the KVPress workflow successfully. We securely collect the Hugging Face token, set environment variables, and import the core modules needed for model loading, pipeline execution, and compression experiments. We also print the runtime and hardware details so we clearly understand the setup in which we perform the tutorial.

if torch.cuda.is_available():
   torch.cuda.empty_cache()
   quantization_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_compute_dtype=torch.float16,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
   )
   pipe = pipeline(
       "kv-press-text-generation",
       model=MODEL_ID,
       device_map="auto",
       token=hf_token if hf_token else None,
       model_kwargs={
           "quantization_config": quantization_config,
           "attn_implementation": "sdpa",
       },
   )
else:
   pipe = pipeline(
       "kv-press-text-generation",
       model=MODEL_ID,
       device_map="auto",
       torch_dtype=torch.float32,
       token=hf_token if hf_token else None,
       model_kwargs={
           "attn_implementation": "sdpa",
       },
   )


def cuda_mem():
   if not torch.cuda.is_available():
       return {"allocated_gb": None, "reserved_gb": None, "peak_gb": None}
   return {
       "allocated_gb": round(torch.cuda.memory_allocated() / 1024**3, 3),
       "reserved_gb": round(torch.cuda.memory_reserved() / 1024**3, 3),
       "peak_gb": round(torch.cuda.max_memory_allocated() / 1024**3, 3),
   }


def reset_peak():
   if torch.cuda.is_available():
       torch.cuda.reset_peak_memory_stats()


def extract_answer(x):
   if isinstance(x, list) and len(x) > 0:
       x = x[0]
   if isinstance(x, dict):
       for k in ["answer", "generated_text", "text", "output_text"]:
           if k in x:
               return x[k]
       return json.dumps(x, indent=2, ensure_ascii=False)
   return str(x)


def generate_once(context, question, press=None, label="run"):
   gc.collect()
   if torch.cuda.is_available():
       torch.cuda.empty_cache()
   reset_peak()
   start = time.time()
   out = pipe(
       context,
       question=question,
       press=press,
       max_new_tokens=MAX_NEW_TOKENS,
       do_sample=False,
       temperature=None,
       return_full_text=False,
   )
   elapsed = time.time() - start
   answer = extract_answer(out)
   stats = cuda_mem()
   result = {
       "label": label,
       "elapsed_sec": round(elapsed, 2),
       "allocated_gb": stats["allocated_gb"],
       "reserved_gb": stats["reserved_gb"],
       "peak_gb": stats["peak_gb"],
       "answer": answer.strip(),
   }
   return result

We initialize the kv-press-text-generation pipeline and configure it differently depending on whether GPU support is available. We define the helper functions that measure CUDA memory usage, reset peak memory, extract answers from model outputs, and run a single generation pass cleanly. This part provides the reusable execution logic that powers the rest of the tutorial and enables us to compare baseline inference with KV cache compression.

company_records = [
   {"company": "Arcturus Dynamics", "hq": "Bengaluru", "founded": 2017, "focus": "warehouse robotics"},
   {"company": "BlueMesa Energy", "hq": "Muscat", "founded": 2014, "focus": "grid analytics"},
   {"company": "CinderPeak Health", "hq": "Pune", "founded": 2019, "focus": "clinical imaging AI"},
   {"company": "DeltaForge Marine", "hq": "Kochi", "founded": 2012, "focus": "autonomous vessel telemetry"},
   {"company": "EonCircuit Labs", "hq": "Hyderabad", "founded": 2020, "focus": "edge silicon tooling"},
   {"company": "Frostline Aero", "hq": "Jaipur", "founded": 2016, "focus": "drone inspection"},
]


needle_facts = [
   "PROJECT NEEDLE 1: The internal codename for the confidential pilot program is SAFFRON-17.",
   "PROJECT NEEDLE 2: The audit escalation owner is Meera Vashisht.",
   "PROJECT NEEDLE 3: The approved deployment region for the first production rollout is Oman North.",
   "PROJECT NEEDLE 4: The emergency rollback phrase is amber lantern.",
   "PROJECT NEEDLE 5: The signed commercial start date is 17 September 2026.",
]


background_block = """
Long-context systems often contain repeated operational notes, historical records, policy sections, and noisy retrieval artifacts.
The goal of this demo is to create a realistically long prompt where only a few details matter for downstream answering.
KV cache compression reduces memory usage by pruning cached key-value pairs while preserving answer quality.
"""


policy_block = """
Operational policy summary:
1. Safety overrides throughput when sensor confidence falls below threshold.
2. Logs should preserve region, timestamp, device class, and operator approval state.
3. Field trials may contain duplicated annexes, OCR-style artifacts, and repeated compliance summaries.
4. A good long-context model must ignore irrelevant repetition and retrieve the specific details that matter.
"""


records_text = []
for i in range(120):
   rec = company_records[i % len(company_records)]
   records_text.append(
       f"Record {i+1}: {rec['company']} is headquartered in {rec['hq']}, founded in {rec['founded']}, and focuses on {rec['focus']}. "
       f"Quarterly memo {i+1}: retention remained stable, operator training progressed, and the compliance appendix was reattached for review."
   )


needle_insert_positions = {18, 41, 73, 96, 111}
full_corpus = []
for i, para in enumerate(records_text):
   full_corpus.append(background_block.strip())
   full_corpus.append(policy_block.strip())
   full_corpus.append(para)
   if i in needle_insert_positions:
       full_corpus.append(needle_facts[len([x for x in needle_insert_positions if x <= i]) - 1])

We create a synthetic long-context dataset to test the KVPress system in a controlled yet realistic way. We define company records, insert important hidden facts at different positions, and mix them with repeated background and policy blocks, making the prompt long and noisy. This helps us simulate the context in which memory-efficient inference matters and the model must retrieve only the truly relevant details.

context = "\n\n".join(full_corpus)


question = textwrap.dedent("""
Answer using only the provided context.
Give a compact JSON object with exactly these keys:
commercial_start_date
deployment_region
audit_owner
rollback_phrase
pilot_codename
""").strip()


print("\nContext characters:", len(context))
print("Approx words:", len(context.split()))


experiments = []


baseline = generate_once(context, question, press=None, label="baseline_no_compression")
experiments.append(baseline)


presses = [
   ("expected_attention_0.7", ExpectedAttentionPress(compression_ratio=0.7)),
   ("expected_attention_0.5", ExpectedAttentionPress(compression_ratio=0.5)),
   ("knorm_0.5", KnormPress(compression_ratio=0.5)),
]


for label, press in presses:
   try:
       result = generate_once(context, question, press=press, label=label)
       experiments.append(result)
   except Exception as e:
       experiments.append({
           "label": label,
           "elapsed_sec": None,
           "allocated_gb": None,
           "reserved_gb": None,
           "peak_gb": None,
           "answer": f"FAILED: {type(e).__name__}: {e}"
       })


try:
   from kvpress import DecodingPress
   sig = inspect.signature(DecodingPress)
   kwargs = {"base_press": KnormPress()}
   if "compression_interval" in sig.parameters:
       kwargs["compression_interval"] = 10
   elif "compression_steps" in sig.parameters:
       kwargs["compression_steps"] = 10
   if "target_size" in sig.parameters:
       kwargs["target_size"] = 512
   elif "token_buffer_size" in sig.parameters:
       kwargs["token_buffer_size"] = 512
   if "hidden_states_buffer_size" in sig.parameters:
       kwargs["hidden_states_buffer_size"] = 0
   decoding_press = DecodingPress(**kwargs)
   decoding_result = generate_once(context, question, press=decoding_press, label="decoding_knorm")
   experiments.append(decoding_result)
except Exception as e:
   experiments.append({
       "label": "decoding_knorm",
       "elapsed_sec": None,
       "allocated_gb": None,
       "reserved_gb": None,
       "peak_gb": None,
       "answer": f"SKIPPED_OR_FAILED: {type(e).__name__}: {e}"
   })

We assemble the final context, define the structured extraction question, and launch the core set of inference experiments. We first run the baseline without compression, then apply multiple press strategies to observe how different compression ratios affect the results. We also conduct a decoding-oriented compression experiment, which extends the tutorial beyond prefilling and provides a broader view of the KVPress framework.

print("\n" + "=" * 120)
print("RESULTS")
print("=" * 120)


for r in experiments:
   print(f"\n[{r['label']}]")
   print("elapsed_sec:", r["elapsed_sec"])
   print("allocated_gb:", r["allocated_gb"])
   print("reserved_gb:", r["reserved_gb"])
   print("peak_gb:", r["peak_gb"])
   print("answer:")
   print(r["answer"])


print("\n" + "=" * 120)
print("SIMPLE SUMMARY")
print("=" * 120)


def safe_float(x):
   try:
       return float(x)
   except Exception:
       return None


base_peak = safe_float(baseline["peak_gb"]) if baseline.get("peak_gb") is not None else None
base_time = safe_float(baseline["elapsed_sec"]) if baseline.get("elapsed_sec") is not None else None


for r in experiments[1:]:
   peak = safe_float(r["peak_gb"])
   t = safe_float(r["elapsed_sec"])
   peak_delta = None if base_peak is None or peak is None else round(base_peak - peak, 3)
   time_delta = None if base_time is None or t is None else round(base_time - t, 2)
   print({
       "label": r["label"],
       "peak_gb_saved_vs_baseline": peak_delta,
       "time_sec_saved_vs_baseline": time_delta,
       "answer_preview": r["answer"][:180].replace("\n", " ")
   })


print("\n" + "=" * 120)
print("OPTIONAL NEXT STEPS")
print("=" * 120)
print("1. Swap MODEL_ID to a stronger long-context instruct model that fits your GPU.")
print("2. Increase context length by duplicating records_text more times.")
print("3. Try other presses from kvpress, such as SnapKVPress, StreamingLLMPress, QFilterPress, or ChunkKVPress.")
print("4. Replace the synthetic corpus with your own long PDF/text chunks and keep the same evaluation loop.")

We print all experiment outputs in a readable format and summarize the runtime and memory differences relative to the baseline. We calculate simple comparison metrics to quickly see how much memory or time each compression strategy saves. We then conclude with suggested next steps to extend the tutorial to stronger models, longer contexts, additional press methods, and real-world document workloads.

In conclusion, we developed a strong practical understanding of how NVIDIA’s KVPress can be used to optimize long-context inference in a realistic Colab-based setting. We did more than simply run a model: we built an end-to-end workflow that installs the framework, loads the pipeline correctly, constructs a meaningful long-context input, applies multiple compression presses, and evaluates the results in terms of answer quality, runtime, and memory behavior. By comparing baseline generation with compressed KV-cache generation, we clearly saw the trade-offs involved. We gained useful intuition about when these methods can help reduce resource pressure without severely harming output fidelity. We also explored the framework’s flexibility by testing different press configurations and including an optional decoding-oriented compression path, providing a broader view of how KVPress can be used beyond a single static example.


Check out the Codes and Notebook hereAlso, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us