惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
WordPress大学
WordPress大学
小众软件
小众软件
Cloudbric
Cloudbric
AWS News Blog
AWS News Blog
腾讯CDC
量子位
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Vulnerabilities – Threatpost
Scott Helme
Scott Helme
Hugging Face - Blog
Hugging Face - Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
The Hacker News
The Hacker News
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
IT之家
IT之家
Jina AI
Jina AI
Attack and Defense Labs
Attack and Defense Labs
S
SegmentFault 最新的问题
Simon Willison's Weblog
Simon Willison's Weblog
The Cloudflare Blog
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
Last Week in AI
Last Week in AI
博客园 - 【当耐特】
Google Online Security Blog
Google Online Security Blog
美团技术团队
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
罗磊的独立博客
L
LINUX DO - 最新话题
博客园 - Franky
博客园 - 叶小钗
Apple Machine Learning Research
Apple Machine Learning Research
The Last Watchdog
The Last Watchdog
J
Java Code Geeks
AI
AI
C
Cisco Blogs
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cyber Attacks, Cyber Crime and Cyber Security
Cisco Talos Blog
Cisco Talos Blog
博客园 - 三生石上(FineUI控件)
雷峰网
雷峰网
Help Net Security
Help Net Security
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
云风的 BLOG
云风的 BLOG
I
Intezer
S
Securelist

Piotr's TechBlog

Deep Dive into Kafka Offset Commit with Spring Boot Claude Code Template for Spring Boot Speed up Java Startup with Spring Boot and Project Leyden Local AI Models with LM Studio and Spring AI Claude Code on OpenShift with vLLM and Dev Spaces Create Apps with Claude Code on Ollama Spring AI with External MCP Servers Istio Spring Boot Library Released Startup CPU Boost in Kubernetes with In-Place Pod Resize
AI Models in Containers with RamaLama
piotr.minkowski · 2026-03-05 · via Piotr's TechBlog

This article explains how to run AI models locally in containers with RamaLama and integrate the sample Java application with them. RamaLama brings AI inferencing to the container world of Podman, Docker, and Kubernetes. It automatically finds and pulls a container image optimized for your system’s GPUs, handling all dependencies and performance tweaks for you. It then uses a container engine, such as Podman or Docker, to pull the required image and prepare everything for running. If you want a hassle-free way to run AI models from multiple sources, using the runtime that fits your hardware, all within containers for simplicity, and with seamless integration with your existing workflows, RamaLama is a good choice. Let’s see how it works in practice!

You can find other articles about AI and Java on my blog. For example, if you are interested in how to use Ollama to serve models for Spring AI applications, you can read the following article.

Source Code

Source code will not play a key role in this article. Nevertheless, feel free to use my source code if you’d like to try it out yourself. To do that, you must clone my sample GitHub repository. It contains the sample Spring Boot application we will use to interact with AI models run on RamaLama in containers. You can find that application in the spring-ai-openai-compatibility directory. Then you should only follow my instructions.

Install RamaLama

You can install RamaLama on Linux or macOS with the following command:

curl -fsSL https://ramalama.ai/install.sh | bash

ShellSession

The script above uses Homebrew to install RamaLama on macOS. But alternatively, you can download the self-contained macOS installer that includes Python and all dependencies. You can find the latest .pkg installer in the releases page.

Finally, you can verify the version of the previously installed tool.

$ ramalama version
  ramalama version 0.17.1

ShellSession

Install and Configure Podman

Alternatively, you can use Docker, which I also use frequently. When it comes to Podman, I suggest installing Podman Desktop first. You can download it here. After installation, launch the Podman Desktop GUI and go to the “Settings” section to create a new Podman machine. Then, choose LibKrun as a default provider. This enabled GPU acceleration for containers running in macOS. This virtual machine is managed by krunkit and libkrun, a lightweight virtual machine manager (VMM) based on Apple’s low-level Hypervisor Framework. You can find a detailed explanation and performance analysis in the following article.

ramalama-containers-podman

After creation, the virtual machine should start up. You can check its status in Podman Desktop as shown below.

Then run the following command to verify that Podman works.

$ podman version
  Client:        Podman Engine
  Version:       5.7.1
  API Version:   5.7.1
  Go Version:    go1.25.5
  Git Commit:    f845d14e941889ba4c071f35233d09b29d363c75
  Built:         Wed Dec 10 15:53:41 2025
  Build Origin:  pkginstaller
  OS/Arch:       darwin/arm64

ShellSession

Run Model with Ramalama

RamaLama supports multiple AI model registries, including OCI Container Registries, Ollama, and HuggingFace. RamaLama defaults to the Ollama registry transport. Let’s assume we want to run the following model from the Ollama registry:

To run that tinyllama model with ramalama, you must execute the following command:

ramalama run tinyllama

ShellSession

By default, RamaLama tries to run the model inside the quay.io/ramalama/ramalama:latest container. The container exposes an OpenAI-compatible API on port 8080.

$ podman ps
CONTAINER ID  IMAGE                             COMMAND               CREATED        STATUS        PORTS                   NAMES
d36129d1f326  quay.io/ramalama/ramalama:latest  llama-server --ho...  5 minutes ago  Up 5 minutes  0.0.0.0:8080->8080/tcp  ramalama-OikvMye7v9

ShellSession

You can interact with the model using the ramalama chat command as shown below:

ramalama chat "What's the date today?"

ShellSession

If you want to change a default model registry, for example, to HuggingFace, use the RAMALAMA_TRANSPORT environment variable.

export RAMALAMA_TRANSPORT=huggingface

ShellSession

Then you can run any GGUF model from HuggingFace.

ramalama run unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF

ShellSession

Instead of the podman command to list running models, you can use the ramalama command. Of course, you can run several models at once. In this case, RamaLama will make them available externally using different ports.

$ ramalama ps
CONTAINER ID  IMAGE                             COMMAND               CREATED         STATUS         PORTS                   NAMES
49eddec5f2ef  quay.io/ramalama/ramalama:latest  llama-server --ho...  3 minutes ago   Up 3 minutes   0.0.0.0:8080->8080/tcp  ramalama-NWPXNpQDBt
023d05c70666  quay.io/ramalama/ramalama:latest  llama-server --ho...  39 seconds ago  Up 39 seconds  0.0.0.0:8086->8086/tcp  ramalama-ogwLbXQNDt

ShellSession

Integrate Spring AI with Models on RamaLama

To test various models with RamaLama, I created a very simple Spring Boot application. It uses Spring AI together with the OpenAI module for integration with models running in RamaLama containers. Below is a list of dependencies for this application.

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-starter-model-openai</artifactId>
        </dependency>
    </dependencies>

XML

The application exposes a single REST endpoint GET /simple/{country}. Depending on the parameter, it asks about the capital city of a given country and requests a brief history of that city.

@RestController
@RequestMapping("/simple")
public class SimpleController {

    private final ChatClient chatClient;

    public SimpleController(ChatClient.Builder chatClientBuilder) {
        this.chatClient = chatClientBuilder
                .defaultAdvisors(SimpleLoggerAdvisor.builder().build())
                .build();
    }

    @GetMapping("/{country}")
    public String ping(@PathVariable String country) {
        PromptTemplate pt = new PromptTemplate("""
                What's the capital of {country} ?
                Describe the history of that city briefly.
        """);

        return chatClient.prompt(pt.create(Map.of("country", country)))
                .call()
                .content();
    }
}

Java

Some of the parameters below are optional, such as the model name or increasing the Spring AI logging level. The app communicates with the LLM model provided by RamaLama at http://localhost:8080. Due to a possible conflict between the ports used, it is best to change the default port used by Spring Boot Web to 9080. The value of the API key, on the other hand, is irrelevant. You just need to set it to something other than null so that Spring AI will accept it…

spring.ai.openai.api-key = ${OPENAI_API_KEY:dummy}
spring.ai.openai.chat.base-url = http://localhost:8080
spring.ai.openai.chat.options.model = tinyllama

logging.level.org.springframework.ai.chat.client.advisor = DEBUG

server.port = 9080

Plaintext

Then, run the app with the following command:

mvn spring-boot:run

Plaintext

Finally, you can call our test REST endpoint for different values of the country parameter.

curl http://localhost:9080/simple/Germany
curl http://localhost:9080/simple/France
curl http://localhost:9080/simple/Italy

Plaintext

The following diagram illustrates GPU usage during our test calls.

ramalama-containers-gpu

The last registry supported by Ramalama that I would like to discuss in this article is the registry of ready-made images containing selected AI models. At the moment, there are slightly more than 20 images with popular models such as gpt-oss, gemma3, qwen, and llama. You can view the full list of available images with models on this webpage.

ramalama-containers-oci-models

To run the container with a specific image, you must add the rlcr:// prefix to the model name. For example, you pull and run the gemma-3-1b-it model as shown below.

ramalama-containers-gemma-image

Then, you don’t even have to restart our sample application if you have already stopped the previously tested models. Here’s a fragment of Gemma’s answer about the Spanish capital.

Use RamaLama to Run Containers with AI Models in Kubernetes

We can use RamaLama to run AI models inside containers on Kubernetes, either on CPU or GPU nodes. However, in this section, I would like to use GPU acceleration on macOS, as I did earlier when running models in Podman. We will try this solution on Minikube. Krunkit is a macOS virtualization tool optimized for GPU-accelerated virtual machines and AI workloads. In the first step, we must install it on macOS using Homebrew:

$ brew tap slp/krunkit
$ brew install krunkit

Plaintext

To use the krunkit driver we must install vmnet-helper. Here’s the command that downloads the latest release from GitHub and installs it to /opt/vmnet-helper. After installing both tools, we can create a Minikube cluster. It is best to use the following command to increase the default resources allocated to the Minikube machine.

minikube start --memory='16gb' --cpus='8' --driver krunkit --disk-size 50000mb

Plaintext

Then, we must install the Kubernetes Generic Device plugin. It enables allocating generic Linux devices, such as serial devices or video cameras, to Kubernetes Pods. In our case, it will allow us to assign GPUs to Pods running AI models. The plugin is installed as a DaemonSet. The following configuration allows us to use up to 4 GPUs in Kubernetes.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: generic-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/name: generic-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: generic-device-plugin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: generic-device-plugin
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      containers:
      - image: squat/generic-device-plugin
        args:
        - --device
        - |
          name: dri
          groups:
          - count: 4
            paths:
            - path: /dev/dri
        name: generic-device-plugin
        resources:
          requests:
            cpu: 50m
            memory: 10Mi
          limits:
            cpu: 50m
            memory: 20Mi
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
  updateStrategy:
    type: RollingUpdate

YAML

As before, we could use a ready-made image from the RamaLama registry. However, if we want to use the GPU support provided by the generic-device-plugin, we should mount the model as a volume to the ramalama container. First, let’s download the gemma-3-1b model from HuggingFace.

$ cd ~/models
$ curl -LO https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguf/resolve/main/gemma-3-1b-it-q4_0.gguf?download=true

ShellSession

Then, we can mount the model from the ~/models directory to Minikube with the following command:

minikube mount ~/models:/mnt/models

ShellSession

The following Deployment uses the quay.io/ramalama/ramalama:latest image and mounts the gemma-3-1b-it-q4_0.gguf model from /mnt/models directory to that container. The model is launched internally using llama-server. The container with the model can use up to 1 GPU out of the 4 allowed across the entire cluster (squat.ai/dri: "1").

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma
  template:
    metadata:
      labels:
        app: gemma
      name: gemma
    spec:
      containers:
        - name: llama-server
          image: quay.io/ramalama/ramalama:latest
          command: [
            llama-server,
            --host, "0.0.0.0",
            --port, "8080",
            --model, /mnt/models/gemma-3-1b-it-q4_0.gguf,
            --alias, "gemma",
            --ctx-size, "4096",
            --temp, "0.7",
            --cache-reuse, "256",
            -ngl, "999",
            --threads, "8",
            --no-warmup,
            --log-colors, auto,
          ]
          resources:
            limits:
              squat.ai/dri: "1"
          volumeMounts:
            - name: models
              mountPath: /mnt/models
      volumes:
        - name: models
          hostPath:
            path: /mnt/models
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: gemma
  name: gemma
spec:
  selector:
    app: gemma
  ports:
    - name: http
      port: 8080
  type: ClusterIP

YAML

Let’s verify if the pod is running after it was deployed:

$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
gemma-5466d666f7-d4wnv   1/1     Running   0          27s

ShellSession

Then, you can expose the gemma Service outside Minikube using port-forward on port 8080. Your application can remain enabled as before. Try repeating the exercise we did earlier and send a few test requests to it. Verify whether the GPU is being used in the graph or during requests.

kubectl port-forward svc/gemma 8080:8080

ShellSession

Conclusion

RamaLama makes AI models execution simple, reproducible, and container-native. It allows us to use various model registries, such as Ollama and HuggingFace, as well as ready-made images from various OCI registries. It also provides an easy path to run AI models on Podman, Docker, and even Kubernetes. Thanks to RamaLama, I was able to leverage Apple Silicon GPUs when running AI models in containers.