GitHub - sergey-automation/TurboPrefill-VLM-Validation: Validation of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill using VLM workloads

Validation of the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models (VLMs).

TurboPrefill cut the waiting time before answer generation nearly in half: from 9.0 s to 4.6 s.

Example Input

Validation Task

Question:

What is happening in this image? Describe the animals, their approximate number, activity, environment, and colors. Which animal appears to be the leader of the group, and what five visual clues made you reach that conclusion? Use no more than 50 words.

Example answer:

Eight giraffes are walking across a grassy wetland near a river. The animals are light brown with darker patches. The leading giraffe appears to guide the group. Clues: front position, direction of movement, spacing, head orientation, and group alignment.

Key Result

Validation on Vision Language Models demonstrates that Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill can significantly reduce user waiting time before answer generation without changing model weights, architecture, quantization, prompts, or inference mathematics.

The observed improvement was achieved solely through changes in execution scheduling during the prefill stage.

Test Configuration

Parameter	Value
Model	Qwen2.5-VL-72B-Instruct-Q4_K_M
Task	Vision-language question answering
Input	Single Full HD image (1920×1080)
GPUs	4× RTX 5060 Ti 16 GB
UBatch size	128
Split mode	Layer

Result

Metric	Baseline	TurboPrefill
Waiting time before the response started	9.0 s	4.6 s
Prefill throughput	303 tok/s	604 tok/s
Generation throughput	8.6 tok/s	8.6 tok/s

TurboPrefill nearly halved the waiting time before the model started responding, while leaving answer generation speed unchanged.

Future Relevance

Validation on NVIDIA Pascal GPUs also demonstrated an approximately 2.2× reduction in prefill latency, suggesting that this optimization opportunity is not tied to a particular class of hardware and will likely remain relevant for future GPU generations.

Background

The original scheduling mechanism was proposed in:

[RFC][PoC] Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill

ggml-org/llama.cpp#24219

The original proof-of-concept implementation is available at:

https://github.com/sergey-automation/TurboPrefill

Purpose

This repository validates the applicability of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill to Vision Language Models.

The objective is not to introduce a new scheduling mechanism, but to demonstrate that the original mechanism is applicable beyond text-only LLM workloads.

Validation Implementation

Reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

Scope of the Current Implementation

The original TurboPrefill PoC intentionally used a conservative dispatcher and left some eligible workloads on the standard llama.cpp execution path.

The current validation implementation enables additional workloads that are still within the original concept of Intra-Prompt Pipeline Scheduling for Multi-GPU Prefill, but were not enabled in the first PoC.

Additional workloads currently enabled for the TurboPrefill execution path:

Execution of Text LLM workloads.
Execution of Vision Language Model (VLM) workloads.
Execution of multiple concurrent requests in multi-user server mode, provided that requests from different users are not mixed within the same TurboPrefill batch.

Status

Work in progress.

Implementation files, scripts, input samples, and benchmark logs are published in this repository.

Experimental work in progress.

The reported results are based on the current prototype implementation. Text-model validation has been completed successfully. VLM support is still under active investigation, and additional correctness validation is required before drawing final conclusions.

Repository Structure

files/ — modified llama.cpp source files used for the validation branch.
scripts/ — scripts used to run the VLM server and resolution tests.
resolution_samples/ — input images used for validation.
benchmarks/ — raw benchmark reports and server logs.

Reproducing the Validation

1. Obtain the reference implementation

The validation was performed using the following reference implementation branch:

https://github.com/sergey-automation/llama.cpp/tree/turboprefill-vlm-support

git clone https://github.com/sergey-automation/llama.cpp.git
cd llama.cpp
git checkout turboprefill-vlm-support

2. Build llama.cpp

Build the reference implementation.

2.1 Download the Qwen2.5-VL-72B GGUF model

The validation uses the following model files:

Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf
mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf

Create the expected model directory:

mkdir -p /workspace/models/Qwen2.5-VL-72B
cd /workspace/models/Qwen2.5-VL-72B

Download the main model:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf"

Download the multimodal projector:

wget -c --content-disposition \
"https://huggingface.co/ggml-org/Qwen2.5-VL-72B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-72B-Instruct-Q8_0.gguf"

Check the files:

ls -lh /workspace/models/Qwen2.5-VL-72B

3. Baseline measurement

Start the VLM server with TurboPrefill disabled:

TURBOPREFILL=0 ./run_vlm_server.sh

Run the benchmark:

python3 run_vlm_resolution.py

4. TurboPrefill measurement

Start the VLM server with TurboPrefill enabled:

TURBOPREFILL=1 ./run_vlm_server.sh