Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Efficient, VRAM-Constrained xLM Inference on Clients

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations

ITAS: A Multi-Agent Architecture for LLM-Based Intelligent Tutoring

Latency and Cost of Multi-Agent Intelligent Tutoring at Scale

TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

A Taxonomy and Resolution Strategy for Client-Level Disagreements in Federated Learning

Usable Agent Discovery for Decentralized AI Systems

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers

Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy

Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

Promoting Simple Agents: Ensemble Methods for Event-Log Prediction

GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

AGNT2: Autonomous Agent Economies on Interaction-Optimized Layer 2 Infrastructure

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

A Delta-Aware Orchestration Framework for Scalable Multi-Agent Edge Computing

Federated Learning over Blockchain-Enabled Cloud Infrastructure

Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Preserving Clusters in Error-Bounded Lossy Compression of Particle Data

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

UCCL-Zip: Lossless Compression Supercharged GPU Communication

Training Time Prediction for Mixed Precision-based Distributed Training

Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Breaking the Training Barrier of Billion-Parameter Universal Machine Learning Interatomic Potentials

A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

DataCenterGym: A Physics-Grounded Simulator for Multi-Objective Data Center Scheduling

Optimizing Stochastic Gradient Push under Broadcast Communications

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning

Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

AgileLog: A Forkable Shared Log for Agents on Data Streams

Secure and Privacy-Preserving Vertical Federated Learning

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks

Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge

Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems

A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

SMART: When is it Actually Worth Expanding a Speculative Tree?

ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving

OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition

DarwinNet: An Evolutionary Network Architecture for Agent-Driven Protocol Synthesis

RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

HearthNet: Edge Multi-Agent Orchestration for Smart Homes

Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

Duration-Informed Workload Scheduler

Domain-Adaptive Model Merging Across Disconnected Modes

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

veScale-FSDP: Flexible and High-Performance FSDP at Scale

AEG: A Baremetal Framework for AI Acceleration via Direct Hardware Access in Heterogeneous Accelerators

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

Emergent Social Structures in Autonomous AI Agent Networks: A Metadata Analysis of 626 Agents on the Pilot Protocol

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Para-B&B: Load-Balanced Deterministic Parallelization of Solving MIP

Rashomon Sets and Model Multiplicity in Federated Learning

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

NPU Design for Diffusion Language Model Inference

PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis

BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving

SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference

Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Power to the Clients: Federated Learning in a Dictatorship Setting

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

Speculative Actions: A Lossless Framework for Faster Agentic Systems

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

Reliable Microservice Tail Latency Prediction via Decoupled Dual-Stream Learning and Gradient Modulation

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

推荐订阅源

cs.DC updates on arXiv.org