From 9 Tiles to 900: Scaling Computer Vision Pipelines

The scale wall

A computer vision pipeline that works on one image at one resolution isn't a pipeline. It's a prototype. The moment you move beyond controlled inputs, you hit the reality of production images: a 4K video frame, a satellite capture, a whole-slide pathology image, a high-resolution document scan. These images don't fit in a single model call. They're too large, too detailed, and too information-dense for one inference pass to handle well.

So you tile it. You divide the image into a grid of regions and run inference on each region independently. A 3×3 grid means 9 inference calls. An 8×8 grid means 64. A whole-slide pathology image at diagnostic resolution? Tens of thousands of tiles.

The orchestration problem scales directly with the image.

And as that tile count grows, so do the failure modes. Nine concurrent inference calls might all succeed. Sixty-four concurrent calls will occasionally hit a throttle limit or a timeout. At hundreds of tiles, partial failures aren't edge cases. They're expected. You need orchestration for your CV pipeline. The real requirement is that your orchestration scales with your image.

The pattern you already use

Tiled inference isn't a niche technique. It's the industry standard for any image that exceeds a model's input constraints. SAHI (Slicing Aided Hyper Inference) has over 35,000 stars on GitHub. It partitions images into overlapping slices, runs detection on each slice, and stitches results together. Digital pathology pipelines routinely tile gigapixel whole-slide images into thousands of patches for parallel inference. Satellite imagery processing architectures on AWS all involve the same core pattern: tile, infer in parallel, aggregate.

The pattern is well-established. What's missing is the orchestration layer that makes it durable at scale. SAHI runs on a single machine. Production pathology pipelines require custom coordinator services, worker pools, and explicit failure handling infrastructure. Everyone builds the same glue differently.

AWS Lambda durable functions introduce an operation called context.map() that maps directly onto this pattern. It fans out an array of items as independent concurrent invocations, each independently checkpointed, with a configurable concurrency cap. One failed tile retries only that tile, not the entire image. The same line of code handles 9 tiles or 900.

What I built

In this post, I walk through an image analysis pipeline I built using durable functions to demonstrate this pattern concretely. The application accepts an image and divides it into an N×N grid of regions. It runs concurrent Amazon Bedrock inferences across the grid, synthesizes the results into a scene description with per-object bounding boxes, and streams progress to a real-time dashboard via WebSocket.

The request flow:

Upload: The browser requests a presigned S3 URL and uploads the image directly to Amazon S3.
Trigger: The browser calls the analyze endpoint. An API Lambda fires the durable pipeline asynchronously and returns AWS AppSync connection details.
Subscribe: The browser opens a WebSocket to AppSync Events and subscribes to the pipeline's execution channel.
Pipeline: A single durable function executes four checkpointed steps: preprocess, analyze (fan-out), synthesize, and store.
Dashboard: Results stream to a shared display as each tile completes, with Jarvis-style bounding box overlays on detected objects.

The entire backend is two Lambda functions: one API handler and one durable pipeline function. No queue infrastructure. No separate orchestration service. No worker pool management.

Walking through the pipeline

Take a look at the pipeline handler. The entire orchestration reads as sequential code: four steps, top to bottom.

export const handler = withDurableExecution(
  async (event: AnalysisPipelineEvent, context: DurableContext) => {

    // Step 1: preprocess - moderate + build region grid
    const preprocessed = await context.step('preprocess', async () => {
      const gridSize = Number(event.gridSize ?? 3);
      const imageBase64 = await fetchImageBase64(event);
      await moderateImage(imageBase64, imageFormat);
      return { regions: buildRegions(gridSize) };
    });

    // Step 2: context.map - parallel region inference
    const mapResults = await context.map(
      'analyze-regions',
      preprocessed.regions,
      async (ctx: DurableContext, region: ImageRegion, index: number) => {
        return await ctx.step(`analyze-region-${index}`, async () => {
          const imageBase64 = await fetchImageBase64(event);
          const finding = await analyzeRegion(imageBase64, imageFormat, region);
          await publish(ch, [{ type: 'region', index, status: 'done', finding }]);
          return {
            regionIndex: finding.regionIndex,
            regionLabel: finding.regionLabel,
            analysis: finding.analysis.slice(0, 500),
            detectedObjects: (finding.detectedObjects ?? []).slice(0, 8),
          };
        });
      },
      { maxConcurrency: 5 },
    );

    const successfulFindings = mapResults.succeeded()
      .map(item => item.result as RegionFinding);

    // Step 3: synthesize
    const synthesis = await context.step('synthesize', () =>
      synthesizeFindings(successfulFindings)
    );

    // Step 4: store
    const stored = await context.step('store', async () => {
      // Persist to DynamoDB + publish dashboard event via AppSync
    });
  }
);

I'll walk through each step and what it does for you at scale.

Step 1: Preprocess

The first step handles content moderation and builds the region grid. The grid size is a parameter. Set it to 3 for a 3×3 grid (9 regions) or 8 for an 8×8 grid (64 regions). The grid size is a function of the image: larger or more complex images benefit from finer-grained tiling.

The durable runtime checkpoints this step. If the Lambda function dies after preprocessing completes, replay skips directly to step 2. The moderation check and grid computation don't repeat.

Step 2: context.map(), the tiled inference step

This is the core of the pattern. context.map() takes the array of regions from step 1 and fans them out as independent concurrent invocations. Each region gets its own checkpointed step. Each invocation fetches the image independently, runs inference against Bedrock, and returns findings for that region.

const mapResults = await context.map(
  'analyze-regions',
  preprocessed.regions,
  async (ctx: DurableContext, region: ImageRegion, index: number) => {
    return await ctx.step(`analyze-region-${index}`, async () => {
      const imageBase64 = await fetchImageBase64(event);
      const finding = await analyzeRegion(imageBase64, imageFormat, region);
      return { /* region findings */ };
    });
  },
  { maxConcurrency: 5 },
);

Three things to notice here.

First, maxConcurrency: 5 caps how many tiles process simultaneously. For the demo I set this to 5. In production, you'd match this to your Bedrock throughput quota: 20, 50, or higher depending on your provisioned capacity.

Second, each tile re-fetches the image from S3 rather than receiving it as input. Image bytes are too large for checkpoint storage, so each tile must be self-contained.

Third, each tile's result is independently checkpointed. If tile 6 out of 9 fails, tiles 1–5 keep their results. Only tile 6 retries.

The model invocation itself uses the Amazon Bedrock Converse API:

export async function invokeNova(
  prompt: string,
  imageBase64: string,
  imageFormat: ImageFormat
): Promise<string> {
  const response = await client.send(new ConverseCommand({
    modelId: MODEL_ID,
    messages: [{
      role: 'user',
      content: [
        { image: { format: imageFormat, source: { bytes: new Uint8Array(Buffer.from(imageBase64, 'base64')) } } },
        { text: prompt }
      ]
    }],
    inferenceConfig: { maxTokens: 512 }
  }));
  return response.output?.message?.content?.[0]?.text;
}

I'm using Amazon Nova Lite for the demo because it's fast and cost-effective for concurrent vision calls. However, the model is a pluggable parameter. You can swap to Anthropic Claude for more nuanced reasoning on the synthesis step, route to an Amazon SageMaker endpoint for a custom-trained detection model, or use different models for different steps entirely.

The orchestration pattern doesn't change. Only the inference call changes.

Step 3: Synthesize

After the map operation completes, all successful region findings are available as an array. The synthesize step aggregates them into a coherent scene description with overall object detection results and computer vision insights.

const successfulFindings = mapResults.succeeded()
  .map(item => item.result as RegionFinding);

const synthesis = await context.step('synthesize', () =>
  synthesizeFindings(successfulFindings)
);

Model selection becomes a scaling lever at this step. The tiled inference step runs N times concurrently, so you want it fast and cheap. The synthesis step runs once and needs to reason across all findings. You might want a more capable model here. Same orchestration code, different model routing per step based on the complexity of the task.

Step 4: Store

The final step persists the analysis result to Amazon DynamoDB and publishes a dashboard event through AppSync. Because this runs inside a checkpointed step, a failure here doesn't repeat the expensive inference steps. Only the storage operation retries.

Scale mechanics: what happens as N grows

The pipeline I've shown works with a 3×3 grid: 9 tiles, 9 inference calls. What happens when you need 64 tiles? Or 400? The code doesn't change. But the architecture decisions I made become increasingly important.

Image size drives tile count

The grid size is a parameter. A 3×3 grid works for a demo image. A high-resolution satellite capture might need an 8×8 grid. A whole-slide pathology image at diagnostic resolution might need a 20×20 grid or larger.

The buildRegions() function generates the grid based on that parameter. The context.map() call processes whatever array it receives. From the orchestration's perspective, 9 regions and 400 regions are the same operation at different scales.

Concurrency cap matches your throughput

The maxConcurrency option controls how many tiles process simultaneously. Set it to 5 for a demo running against on-demand Bedrock. Set it to 50 for a production workload with provisioned throughput. Set it to 200 for a batch job with a high-throughput SageMaker endpoint. The durable runtime manages the fan-out and concurrency without you building a queue or a semaphore.

The 256 KB checkpoint limit enforces clean architecture

Durable function checkpoints have a 256 KB size limit per step result. This means you cannot pass image bytes through a checkpoint. They're too large. Each tile re-fetches the image from S3 independently.

At 9 tiles, this feels like an overhead you'd rather avoid. At 400 tiles, it's the only sane architecture. You want each tile to be a self-contained unit that reads its input, runs inference, and returns a small result object. The checkpoint limit enforces this discipline from day one.

For higher tile counts, you can eliminate the per-tile S3 API calls entirely by mounting your image bucket with Amazon S3 Files. With S3 Files, the Lambda function reads the image directly from the local filesystem. No GetObject calls, no SDK overhead, no presigning. The image is a file path. At 9 tiles the difference is negligible. At 400 concurrent tiles each making a GetObject call, filesystem access becomes a meaningful optimization.

Partial failure at scale

At 9 tiles, one failure is an annoyance. You might tolerate restarting all 9. At 64 tiles, restarting all 64 because tile 47 hit a timeout is a waste of compute, time, and money. At 400 tiles, it's unacceptable. The mapResults object gives you fine-grained failure handling:

const successfulFindings = mapResults.succeeded()
  .map(item => item.result as RegionFinding);

if (mapResults.failureCount > 0) {
  mapResults.failed().forEach(item =>
    context.logger.error('Region failed', { index: item.index, error: String(item.error) })
  );
}

Successful tiles keep their checkpointed results. Failed tiles can be logged, retried independently, or excluded from the synthesis. The pipeline degrades gracefully rather than failing catastrophically.

Model selection as a scaling lever

As tile count grows, cost per inference call matters more. With 9 tiles, using a capable (expensive) model for each tile is reasonable. With 400 tiles, you want the cheapest model that produces acceptable results for the per-tile work, and reserve the capable model for the single synthesis step. The orchestration code stays identical. You change a model ID parameter, not the pipeline structure.

Real-time observability at scale

Every tile publishes its completion status through AWS AppSync Events:

await publish(ch, [{ type: 'region', index, status: 'done', finding }]);

At 9 tiles, this produces a satisfying progress indicator. Users watch regions light up on a dashboard as inference completes. At 64 tiles, real-time observability becomes essential rather than nice-to-have. Without per-tile status events, a 64-tile pipeline is a black box that either succeeds after two minutes or fails with no indication of where it stalled.

The dashboard in this demo subscribes to the pipeline's execution channel and renders results as they arrive. Each tile's bounding box detections overlay onto the original image in real time. At scale, this pattern gives operators visibility into pipeline health without polling: which tiles completed, which are in progress, which failed.

Get started

The complete source, including deploy instructions, frontend setup, and teardown, is available on GitHub: image-analysis-orchestration.

To experiment with scale, change the gridSize parameter when triggering the pipeline. Start with 3 (9 tiles). Try 5 (25 tiles). Push to 8 (64 tiles) and watch how the same code handles increased concurrency with checkpointed resilience.

Tiled inference is already your pattern. If you're working with images that don't fit in one model call (and at production resolution, most interesting images don't), you're already tiling, processing in parallel, and aggregating results. With durable functions, you get checkpointed, resilient orchestration for that pattern without building separate infrastructure. The context.map() call that handles 9 tiles handles 900. Your orchestration scales with your image.

This isn't a toy demo. It's the skeleton of production batch inference.

推荐订阅源

DEV Community