惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
V
V2EX
P
Proofpoint News Feed
The Hacker News
The Hacker News
GbyAI
GbyAI
G
Google Developers Blog
S
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
Recent Announcements
Recent Announcements
腾讯CDC
Hacker News - Newest:
Hacker News - Newest: "LLM"
K
Kaspersky official blog
U
Unit 42
Engineering at Meta
Engineering at Meta
J
Java Code Geeks
Google Online Security Blog
Google Online Security Blog
Last Week in AI
Last Week in AI
V
Vulnerabilities – Threatpost
N
News and Events Feed by Topic
O
OpenAI News
量子位
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Y
Y Combinator Blog
博客园 - 【当耐特】
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
Microsoft Security Blog
Microsoft Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog
MongoDB | Blog
MongoDB | Blog
S
Security Affairs
A
About on SuperTechFans
Project Zero
Project Zero
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 聂微东
Webroot Blog
Webroot Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Cloudbric
Cloudbric
T
Tenable Blog
月光博客
月光博客
C
Check Point Blog
宝玉的分享
宝玉的分享
V
Visual Studio Blog
T
The Blog of Author Tim Ferriss
NISL@THU
NISL@THU

二叉树的博客

openEuler 22.03 arm64 离线包适配记录 在线 IDE 的通用容器镜像打包流程 麒麟 V10 安装时配置本地源 BKE 与 BMP 部署排障流程 containerd 允许 HTTP 镜像仓库拉取 Red Hat 8 amd64 离线依赖适配记录 通用docker容器镜像打包应用的流程 kubevirt初体验 kubevirt初体验 魔改dockur制作可迁移的Windows镜像(单文件) 魔改dockur制作可迁移的Windows镜像(单文件) 通过dockur制作可迁移的Windows镜像(双文件) 通过ProxmoxVE制作kubevirt可用的Windows镜像 通过dockur制作可迁移的Windows镜像(双文件) 通过ProxmoxVE制作kubevirt可用的Windows镜像 基于docker的在线IDE制作(支持浏览器直接访问) 基于docker的在线IDE制作(支持浏览器直接访问) k8s日常问题排障 k8s日常问题排障 jetbrains家的goland项目可用但老爆红 jetbrains家的goland项目可用但老爆红 机房本地服务器自建Gitea并使用 机房本地服务器自建Gitea并使用 nvidia-smi被自动升级无法与GPU通信了怎么办 conda使用GPU时的一些陷阱 给机房的LXD容器配置跳板机进行连接 在Pycharm上连接远程虚拟环境进行使用 给机房的Ubuntu22.04安装LXD共享GPU资源 给机房的Ubuntu22.04的Linux进行内穿映射端口
Ray 支持昇腾 NPU 的 KubeRay 对接记录
二叉树上的我 · 2025-05-23 · via 二叉树的博客

这篇记录 Ray 在 Kubernetes 中对接昇腾 NPU 的流程,包括 Ray 镜像制作、KubeRay Operator 安装、RayCluster 编排和资源验证。环境基于 aarch64,镜像中需要带入宿主机昇腾驱动运行所需文件。

检查 NPU 占用

部署前先确认 NPU 没有被其他进程占用:

只有确认目标卡空闲,后续 Ray worker 才更容易稳定拉起。

Ray 镜像制作

构建目录中需要提前放入宿主机驱动相关文件:

  • dcmi
  • npu-smi
  • common
  • driver
  • ascend_install.info
  • vnpu.cfg
  • version.info

Dockerfile 示例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
FROM docker.spiritlhl.news/rayproject/ray:2.40.0.160e35-py310-aarch64

USER root

RUN apt-get update && apt-get install -y \
    language-pack-zh-hans \
    fonts-noto-cjk \
    fonts-wqy-microhei \
    && locale-gen zh_CN.UTF-8 && \
    update-locale LANG=zh_CN.UTF-8

ENV LANG=zh_CN.UTF-8
ENV LANGUAGE=zh_CN:zh
ENV LC_ALL=zh_CN.UTF-8

RUN apt-get update && apt-get install -y \
    libxrender1 \
    libxtst6 \
    libxi6 \
    libxext6 \
    libfontconfig1 \
    libfreetype6 \
    libx11-6 \
    git \
    vim \
    nano \
    sudo \
    python3-pip

RUN pip install ray[client]==2.40.0 --index-url https://pypi.tuna.tsinghua.edu.cn/simple

ENV LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:$LD_LIBRARY_PATH
ENV LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH

COPY dcmi /usr/local/dcmi
COPY npu-smi /usr/local/bin/npu-smi
COPY common /usr/local/Ascend/driver/lib64/common
COPY driver /usr/local/Ascend/driver/lib64/driver
COPY ascend_install.info /etc/ascend_install.info
COPY vnpu.cfg /etc/vnpu.cfg
COPY version.info /usr/local/Ascend/driver/version.info

RUN chmod +x /usr/local/bin/npu-smi

ENV USER_ID=0
ENV GROUP_ID=0
RUN sed -i '/app::/d' /etc/passwd

WORKDIR /home/ubuntu/
ENV HOME=/home/ubuntu/

docker.spiritlhl.news 只是镜像加速地址,去掉后对应原 Docker Hub 镜像地址。

构建并推送到内网仓库:

1
docker build -t deploy.bocloud.k8s:40443/public/ray:2.40.0-py310-aarch64-npu .

安装 KubeRay

1
2
git clone https://github.com/ray-project/kuberay.git
cd kuberay/ray-operator/config/default

确认当前目录存在 kustomization.yaml 后安装:

1
2
kubectl create -k .
kubectl get pods -n default

创建 RayCluster

创建 raycluster.yaml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-npu
  namespace: default
spec:
  rayVersion: "2.40.0"
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: "0.0.0.0"
    template:
      spec:
        containers:
          - name: ray-head
            image: deploy.bocloud.k8s:40443/public/ray:2.40.0-py310-aarch64-npu
            imagePullPolicy: IfNotPresent
            env:
              - name: ASCEND_VISIBLE_DEVICES
                value: "0"
            ports:
              - containerPort: 6379
              - containerPort: 8265
              - containerPort: 10001
            volumeMounts:
              - name: dev-davinci0
                mountPath: /dev/davinci0
              - name: dev-davinci-manager
                mountPath: /dev/davinci_manager
              - name: dev-devmm
                mountPath: /dev/devmm_svm
              - name: dev-hdc
                mountPath: /dev/hisi_hdc
              - name: ascend-root
                mountPath: /usr/local/Ascend
            securityContext:
              privileged: true
        volumes:
          - name: dev-davinci0
            hostPath:
              path: /dev/davinci0
              type: CharDevice
          - name: dev-davinci-manager
            hostPath:
              path: /dev/davinci_manager
              type: CharDevice
          - name: dev-devmm
            hostPath:
              path: /dev/devmm_svm
              type: CharDevice
          - name: dev-hdc
            hostPath:
              path: /dev/hisi_hdc
              type: CharDevice
          - name: ascend-root
            hostPath:
              path: /usr/local/Ascend
              type: Directory
  workerGroupSpecs:
    - groupName: worker-group
      replicas: 2
      minReplicas: 1
      maxReplicas: 2
      rayStartParams: {}
      template:
        spec:
          containers:
            - name: ray-worker
              image: deploy.bocloud.k8s:40443/public/ray:2.40.0-py310-aarch64-npu
              imagePullPolicy: IfNotPresent
              env:
                - name: ASCEND_VISIBLE_DEVICES
                  value: "0"
              volumeMounts:
                - name: dev-davinci0
                  mountPath: /dev/davinci0
                - name: dev-davinci-manager
                  mountPath: /dev/davinci_manager
                - name: dev-devmm
                  mountPath: /dev/devmm_svm
                - name: dev-hdc
                  mountPath: /dev/hisi_hdc
                - name: ascend-root
                  mountPath: /usr/local/Ascend
                - name: dcmi-root
                  mountPath: /usr/local/dcmi
              securityContext:
                privileged: true
          volumes:
            - name: dev-davinci0
              hostPath:
                path: /dev/davinci0
                type: CharDevice
            - name: dev-davinci-manager
              hostPath:
                path: /dev/davinci_manager
                type: CharDevice
            - name: dev-devmm
              hostPath:
                path: /dev/devmm_svm
                type: CharDevice
            - name: dev-hdc
              hostPath:
                path: /dev/hisi_hdc
                type: CharDevice
            - name: ascend-root
              hostPath:
                path: /usr/local/Ascend
                type: Directory
            - name: dcmi-root
              hostPath:
                path: /usr/local/dcmi
                type: Directory

应用配置:

1
2
3
kubectl apply -f raycluster.yaml
kubectl get rayclusters
kubectl get pods -n default -l ray.io/cluster=raycluster-npu

验证 Ray 集群资源

参考 Ray 官方 Kubernetes 快速开始文档:

https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html

获取 head Pod:

1
2
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
echo "$HEAD_POD"

打印集群资源:

1
kubectl exec -it "$HEAD_POD" -- python -c "import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)"

期望能看到类似资源:

1
2
3
4
5
6
{'CPU': 96.0,
 'NPU': 2.0,
 'memory': 84300556493.0,
 'node:10.250.0.45': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 40414524211.0}

如果没有 NPU,优先检查设备挂载、ASCEND_VISIBLE_DEVICES、宿主机 /usr/local/Ascend/usr/local/dcmi 是否一致。

清理 Ray 集群

1
kubectl delete -f raycluster.yaml