惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Oskyla 烹茶室

修复 Joplin on KDE 菜单栏显示问题 极简但强大的个人财务管理软件 —— 溪流记账(Rivulet) Copy Fail:Linux 内核 2017 年至今的高危漏洞(附临时缓解方案) | CVE-2026-31431 Hermes Agent — 在 K3s / K8s 中运行指南 在 K3s 节点上安装并使用 nerdctl Mouser:轻量开源的罗技鼠标驱动替代方案 Claude Opus 4.7:优缺点与评测信息汇总 openFuyao NPU-Operator故障排查 openFuyao 2603 共测测试报告 ceph mon Operation not permitted 问题解决 Ascend 310P + openFuyao + NPU-Operator 故障排查 KDE Plasma6 禁用全局菜单,恢复正常应用菜单 终极指南:在 Linux 裸机服务器上快速部署 Moltbot (原 Clawbot) 并集成飞书 Windows 配置 Claude Code 解决 settings.json 不生效 Windows 配置 Claude Code 全流程 2025-12-31 | 年终总结 AI 生图精品提示词|第二期:城市星球 AI 生图精品提示词|第一期 Kubernetes kubectl --raw 使用指南 2025 年黑五云主机活动汇总(含双十一) 彻底解决阿里云和 tailscale 冲突 2025-10-21 | 沉淀思维 macOS 单独为鼠标或触控板开启自然滚动 2025-10-16 | 负载高低 2025-10-15 | 睡眠周期 2025-10-14 | 转换情绪与独立观点 go 拉取 gitcode.com 私有 mod Git 将某个文件恢复到其他分支的状态 SSH 通过跳板机连接 lxc 使用 chronyc 构建 ntp 服务 2025-10-13 | 独立思考于未来能源 2025-10-12|AI Review 及高质量工作流 2025-10-11 | 人生阶段 2025-10-10 | AI Logo | 远程操控 AI 2025-10-09 | 去班味|银杏|域名 2025-10-08 2025-10-07 2025-10-06 2025-10-05 优秀的 SubReddit 清单 Synology 群晖重置 Syncthing 密码 整理了一个 AI 提示词库 让 LLM 看到真实世界的 Playwright MCP 磁盘占用分析利器 ncdu 自建 gitlab 徽标问题导致项目 500 问题解决 harbor Chartmuseum helm 图表缓存刷新 grep exclude 排除 Debian 11 + PVE LACP Mac 冲突问题调查与解决方案 Claude Code 实用技巧 CentOS 7 重置 root 密码 telnet 如何退出 kubernetes 的挂载传播(mount propagation)机制 vim 将命令输出到当前位置 harbor 替换 ssl 证书 AI提效之使用 cherry-studio + k8sgpt 实现 AI 巡检 k8s Claude Code Subagents 快速开始 《我看见的世界:李飞飞自传》 Lyra - AI Prompt Optimization Specialist Linux 自签名 CA 证书安装方法 智能理财计算器更新了,一站对比各家理财收益率,避免踩坑! 超级玛丽、魂斗罗等经典小游戏聚合站 | PlayUnb 个人业余开发项目清单 快速选出收益最高的理财产品 | finance-calculator 苹果液态玻璃风格图像生成和模拟器 | LiquedGlass.icu Linux 进程绑定NUMA节点或CPU核心 判断GPT是否降智的几个问题 Harmony next hap 安装 名侦探柯南贝尔摩德出场集数 k3s k8s 快速部署轻量节点监控方案 beszel k3s-k8s 实现 DevOps 方案横向对比 k8s 配置访问私有镜像仓库 GoAccess 分析多网站日志方法 Octant - 以开发人员为中心的开源 Kubernetes Web 界面 Tailscale 自建 DERP 并配置 SSL 完整教程 OpenManus 使用记录 Plausible 缺失 location 信息的研究 解决 Nginx Ingress returns 413 Entity Too Large 绘图模型效果对比之城市气象 【转】k8s 认知路线 OpenFOAM 两大分支的详细比较 第一个 CUDA 程序之矩阵运算计算效能对比 Archlinux KDE Apache JMeter 配置高分屏缩放 解决 gitlab-runner 移除残留文件 permission denied nginx-ingress 配置路由 302 k8s 触发 pod 重新拉取镜像平滑升级的方法 Clickhouse 迁移后 permission denied 问题解决 Linux CPU 运行模式及功耗分析 Linux vim vi 翻页跳转命令快捷键 git 拉取所有 branch 和 tag 到本地并推送到远程 Rails 性能分析工具 rack-mini-profiler 和 bullet 全球国家、城市、地区开源数据库 Rails Active Record 常用命令 Rails Rake 简介与编写 如何调试 Vim 脚本 Tailscale 自建 Derp Ceph 检查 rbd io 排名 k8s csi-driver-nfs的一个坑 一条命令测试 pg 查询延迟 k3s 容器 mirror 配置方法 wordpress 使用 k8s 部署并使用 nginx ingress 代理无限 302 到 ssl 问题解决
openFuyao InferNex AI推理集成部署 310P(300I Pro) 环境问题记录及解决
2026-04-13 · via Oskyla 烹茶室

本文 首发于 🌱 煎茶转载 请注明 来源

AI推理集成部署(InferNex)是一个专为云原生环境下AI推理服务优化所设计的端到端集成部署方案。该方案基于Kubernetes Gateway API Inference Extension (GIE) 和主流LLM技术栈构建,通过Helm Chart将开源网关、智能路由、高性能推理后端、全局KVCache管理、扩缩容决策框架及推理可观测体系等核心加速模块无缝集成。它提供从请求接入、动态路由、推理执行到资源管理与监控的完整加速链路,旨在提升推理吞吐量并降低TTFT/TPOT时延,实现一站式的高效AI服务部署体验。

相关的文档如下:

因为官方仅针对 910 做了验证,手头只有一张 310P ,理论上是可以跑起来,但是需要做一系列修改,本文记录部署遇到的各种问题及其解决方案。

部署后有几个 pod 一直起不来:

NAMESPACE                     NAME                                                          READY   STATUS      RESTARTS       AGE  
ai-inference                  vllm-pd-2p1d-01-decode-54cc4c7579-5h62w                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg                       0/1     Init:0/2    0              5d  
ai-inference                  vllm-pd-2p1d-01-prefill-5c546dbcc-thmkd                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-prefill-fd68f87cf-jjdlc                       0/1     Pending     0              5d  

hccn 问题

似乎是 hccn 找不到

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:31:33 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
  huggingface-download:
    Container ID:  
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-jswfw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  5d                      default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  5d (x2 over 5d)         default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  9m32s                   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m20s (x24 over 9m29s)  default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         8m58s                   default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg to master1
  Warning  FailedMount       44s (x12 over 8m58s)    kubelet            MountVolume.SetUp failed for volume "hccntool" : hostPath type check failed: /usr/bin/hccn_tool is not a file

暂时绕过:

touch /usr/bin/hccn_tool
chmod +x /usr/bin/hccn_tool

huggingface-download 失败

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:45:19 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  cni.projectcalico.org/containerID: 32b3384131b69054ca45acc8afe5e272b0ca681ea6d0611b3fec7316e3532e80
                  cni.projectcalico.org/podIP: 192.168.137.155/32
                  cni.projectcalico.org/podIPs: 192.168.137.155/32
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               192.168.137.155
IPs:
  IP:           192.168.137.155
Controlled By:  ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  containerd://4ede488dd17e33f3a980aee6fa4eac3093ad6366c3854fe85436f81f6e1df7bb
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      hub.oepkgs.net/openfuyao/mikefarah/yq@sha256:4facc66fdcc785ec961ef7f2185f53f862f462eefe1d50c2eb311c2bb26823e3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:45:20 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
  huggingface-download:
    Container ID:  containerd://84e15b7d9e2f7382181e309c1558174ec48e58ad0ae14f92ae0dfff284da76e5
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      cr.openfuyao.cn/openfuyao/huggingface-download@sha256:ac86348b5e6934a020c21c4f0ebf81b520194ba8e549f1847ecc7521b82d9a8d
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Running
      Started:      Tue, 07 Apr 2026 09:47:33 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:47:32 +0800
    Ready:          False
    Restart Count:  1
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-9jvbd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  3m20s               default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         2m15s               default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp to master1
  Normal   Pulled            2m14s               kubelet            Container image "hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1" already present on machine
  Normal   Created           2m14s               kubelet            Created container: mooncake-config-init
  Normal   Started           2m14s               kubelet            Started container mooncake-config-init
  Normal   Pulled            1s (x2 over 2m14s)  kubelet            Container image "cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2" already present on machine
  Normal   Created           1s (x2 over 2m14s)  kubelet            Created container: huggingface-download
  Normal   Started           1s (x2 over 2m14s)  kubelet            Started container huggingface-download

查看错误日志:

# 看当前这次的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download

# 看上一次失败的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous

[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download
[root@master1 ~]# 
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
    raise exc
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
    stream = self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 124, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 240, in snapshot_download
    repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3285, in repo_info
    return method(
           ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3020, in model_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1053, in get
    return self.request(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/hf", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/hf.py", line 113, in main
    app()
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1152, in __call__
    raise e
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1135, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 795, in main
    return _main(
           ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 188, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1514, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 224, in download
    _print_result(run_download())
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 185, in run_download
    return snapshot_download(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 324, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Got: ConnectError: [Errno 101] Network is unreachable
An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

确认是网络问题:节点无法访问 HuggingFace(Network is unreachable),而且本地也没有缓存。 解决方案:换国内镜像源(推荐) 在 Deploymenthuggingface-download init container 里加一个环境变量:

env:
  - name: HF_ENDPOINT
    value: "https://hf-mirror.com"

最好给 decode-engine 也加上,否则报同样的错。

看日志可能没有任何输出

kubectl -n ai-inference logs deployments/vllm-pd-2p1d-01-decode huggingface-download  -f

此时查看 llm 目录大小即可,可以看到不断在变化:

$ watch -n 2 -d 'du -sh /home/llm_cache/'
426M    /home/llm_cache/
[root@master1 ~]# kubectl  -n ai-inference logs  vllm-pd-2p1d-01-decode-7d487c49cd-qw89v
Defaulted container "decode-engine" out of: decode-engine, mooncake-config-init (init), huggingface-download (init)
...
INFO 04-07 05:15:19 [__init__.py:217] Platform plugin ascend is activated
(EngineCore_DP0 pid=94) INFO 04-07 05:15:33 [ascend_config.py:55] Linear layer sharding enabled with config: None. Note: This feature works optimally with FLASHCOMM2 and DSA-CP enabled; using it without these features may result in significant performance degradation.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] EngineCore failed to start.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] Traceback (most recent call last):
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     super().__init__(
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self._init_executor()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     check_ascend_device_type()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.
(EngineCore_DP0 pid=94) Process EngineCore_DP0:
(EngineCore_DP0 pid=94) Traceback (most recent call last):
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=94)     self.run()
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=94)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 72, in run_engine_core
(EngineCore_DP0 pid=94)     raise e
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94)     super().__init__(
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94)     self._init_executor()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94)     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94)     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94)     check_ascend_device_type()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94)     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.

经过排查是 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是针对 910 构建的,找到了官方说明

310p 应该使用带有 310p 后缀的镜像。在 镜像仓库 寻找后替换为 quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler 尝试。

对比 sha256 发现 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是完全的 quay.io/ascend/vllm-ascend:v0.13.0 镜像,sha256 完全一致。

310p vllm-ascend 报错

但是替换为其他带有 310p 后缀的 image ,启动后会报错:

Every 1.0s: kubectl -n ai-inference get pod                                                            master1: Thu Apr  9 08:36:06 2026

NAME                                          READY   STATUS             RESTARTS        AGE
cache-indexer-deployment-65d5b449f6-x9l46     1/1     Running            0               17h
inference-gateway-istio-5f9b7d78f6-7kbrw      1/1     Running            26 (15h ago)    17h
infernex-epp-5cc456bd-4vvmv                   1/1     Running            0               17h
mooncake-master-deployment-74cc5666b7-fr4fq   1/1     Running            0               17h
redis-server-deployment-67566b9765-m66lc      1/1     Running            0               17h
vllm-pd-2p1d-01-decode-7687ccb7b-vg98n        0/1     CrashLoopBackOff   161 (31s ago)   16h
vllm-pd-2p1d-01-prefill-66f7564d7f-tdd84      0/1     Pending            0               43h
vllm-pd-2p1d-01-prefill-fd68f87cf-mhtdk       0/1     Pending            0               40h
vllm-pd-2p1d-01-proxy-7ff4f59865-h8xbw        1/1     Running            0               17h




────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/ascend_multi_connector.py", line 5, in <module>
(APIServer pid=1)     from vllm_ascend.distributed.kv_transfer.kv_p2p.mooncake_layerwise_connector import MooncakeLayerwiseConnector
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py", line 25, in <module>
(APIServer pid=1)     from mooncake.engine import TransferEngine  # type: ignore
(APIServer pid=1)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ModuleNotFoundError: No module named 'mooncake'
(APIServer pid=1) [ERROR] 2026-04-08-08:58:03 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-7687ccb7b-vg98n | grep -i image:
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image:         quay.io/ascend/vllm-ascend:main-310p

针对该问题,openfuyao 给出的方案如下:

@tl.s InferNex在310P环境部署问题排查:

vllm-ascend的310P镜像没有加入mooncake,所以在prefill/decode之间kvcache数据传输无法支持。
https://github.com/vllm-project/vllm-ascend/blob/main/Dockerfile.310p

建议使用聚合模式部署,可参考InferNex聚合模式示例,部署时将`inference-backend.services[0].kvTransferConfig` 配置项删除,即可不使用mooncake相关能力:
https://gitcode.com/openFuyao/InferNex/blob/0.22.2/examples/vllm-aggregated-random-values.yaml

vllm-ascend针对310P在线推理文档:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

310P的推理还未验证过,可以尝试v0.13.0或者 v0.18.0rc1,这两个版本vllm官方有文档支持。

aggregated 模式卡个数错误

环境中只有一张 310p,但是默认申请两张,需要修改两处,分别是资源申请个数 1 ,以及 vllm 启动参数 tensor_parallel_size 设定为 1 。

resources:
          limits:
            cpu: "8"
            huawei.com/Ascend310P: "1"
            memory: 64Gi
          requests:
            cpu: "4"
            huawei.com/Ascend310P: "1"
            memory: 32Gi
...
  # start vllm service
          vllm serve Qwen/Qwen3-8B \
            --served-model-name Qwen/Qwen3-8B \
            --trust-remote-code \
            --enable-prefix-caching \
            --port 8000 \
            --tensor-parallel-size 1 \

bf16 数据类型报错

(EngineCore pid=35) [PID: 35] 2026-04-09-08:27:33.217.113 AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT,DT_FLOAT16,DT_INT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT8,DT_BOOL,DT_DOUBLE,].
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:27:49 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

根据 AI 说法,暂未查证:Ascend 310P 芯片不支持 bfloat16(bf16)数据类型,但 vLLM 在初始化 rotary embedding(RoPE)时使用了 torch.ones(…, dtype=torch.bfloat16),导致 ACL 算子报错。 310P 的算子库支持的浮点类型只有 float32 和 float16,不包含 bf16。

通过参数强制使用 float16 --dtype half 解决:

bashvllm serve Qwen/Qwen3-8B \
  --dtype half \   # 强制使用 float16 而非 bfloat16
  ...其他参数

npu_dynamic_quant 算子报错

解决以上问题后 pod 可以运行更久

(EngineCore pid=35) INFO 04-09 08:43:11 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen3-8B: 1.077223 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:08<00:32,  8.07s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:16<00:24,  8.33s/it]

但最终依然 error ,跟踪报错信息如下:

(EngineCore pid=35) INFO 04-09 08:47:55 [default_loader.py:384] Loading weights took 35.12 seconds
(EngineCore pid=35) INFO 04-09 08:47:57 [model_runner_v1.py:2589] Loading model weights took 17.6043 GB
.(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/4de24ceb58/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:1048] Dynamo bytecode transform time: 13.07 s
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_with_optional_nvtx_range(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return callable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise BackendCompilerFailed(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.configure_post_pass()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._trace_inner(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     t = dispatch_trace(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         ^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return disable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     (self.create_arg(fn(*args)),),
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                      ^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         TraceBack (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling failed
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling Failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35)     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35)     return self._call_with_optional_nvtx_range(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35)     return callable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35)     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35)     raise BackendCompilerFailed(
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35)     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35)     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35)     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35)     self.configure_post_pass()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35)     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35)     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35)     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35)     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35)     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35)     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35)     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35)     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35)     return self._trace_inner(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35)     t = dispatch_trace(
(EngineCore pid=35)         ^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35)     return disable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35)     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35)     (self.create_arg(fn(*args)),),
(EngineCore pid=35)                      ^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35)     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35)           ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35)     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35)     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35)     out = func(*args, **kwargs)
(EngineCore pid=35)           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35)         TraceBack (most recent call last):
(EngineCore pid=35)         Tiling failed
(EngineCore pid=35)         Tiling Failed.
(EngineCore pid=35)         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35)         DynamicQuant launch kernel failed.
(EngineCore pid=35) 
(EngineCore pid=35) 
(EngineCore pid=35) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:48:30 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

根据 ai 的说法:根本原因:norm_quant 融合 Pass 在编译阶段向 310P 发起 npu_dynamic_quant 算子,而 310P 不支持该动态量化算子(或当前 CANN 版本不兼容),导致 Tiling 失败。

openfuyao 社区给出的回应如下:

@tl.s  看了一下报错日志,应该是310卡不支持vllm-ascend默认开启的算子 DynamicQuantV2,可以加上启动配置项 –enforce-eager 和 –no-quant 尝试一下。 在InferNex中,默认未直接提供的vllm启动参数可以在 inference-backend.services[0].pd.prefill/decode.extraArgs添加。例如: extraArgs: - “–enforce-eager” - “–no-quant "

若还是不行,可以尝试更换模型,按照官方310P文档内的示例部署,如 Qwen2.5-7B-Instruct。 https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

OOM

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference logs deployments/vllm-pd-2p1d-01 -f 
Defaulted container "aggregated-engine" out of: aggregated-engine, huggingface-download (init)



INFO 04-13 06:21:29 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:21:29 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:21:29 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:21:29 [__init__.py:239] Platform plugin ascend is activated
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
WARNING 04-13 06:21:44 [__init__.py:80] The quantization method 'ascend' already exists and will be overwritten by the quantization config <class 'vllm_ascend._310p.quantization.modelslim_config.AscendModelSlimConfig310'>.
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-7B-Instruct
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-7B-Instruct', 'model': 'Qwen/Qwen2.5-7B-Instruct', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 4096, 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen2.5-7B-Instruct'], 'block_size': 128, 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True, 'max_num_batched_tokens': 40960, 'kv_events_config': KVEventsConfig(enable_kv_cache_events=True, publisher='zmq', endpoint='tcp://*:5557', replay_endpoint=None, buffer_steps=10000, hwm=100000, max_queue_size=100000, topic='kv-events')}
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT_HTTP_API
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_HOST
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_USE_V1
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) WARNING 04-13 06:22:18 [model.py:1920] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 04-13 06:22:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=40960.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:749] Parameter '--disable-cascade-attn' is a GPU-specific feature. Resetting to False for Ascend.
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:838] Ignored parameter 'disable_flashinfer_prefill'. This is a GPU-specific feature not supported on Ascend. Resetting to False.
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:425] Dynamic EPLB is False
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:426] The number of redundant experts is 0
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:297] Compilation disabled, using eager mode by default
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=1) INFO 04-13 06:22:26 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
INFO 04-13 06:22:53 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:22:53 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:22:53 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:22:53 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(EngineCore pid=35) INFO 04-13 06:23:03 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'vllm_ascend.compilation.compiler_interface.AscendCompiler', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [40960], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=35) WARNING 04-13 06:23:08 [camem.py:66] Failed to import vllm_ascend_C:/vllm-workspace/vllm-ascend/vllm_ascend/vllm_ascend_C.cpython-311-aarch64-linux-gnu.so: undefined symbol: _ZN9pp_matmul17GetPpMatmulTilingERKNS_10MatMulInfoERKNS_12HardwareInfoERjRNS_18PpMatmulTilingDataE. Sleep mode will be disabled. 
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:425] Dynamic EPLB is False
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:426] The number of redundant experts is 0
INFO 04-13 06:23:22 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:23:22 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:23:22 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:23:22 [__init__.py:239] Platform plugin ascend is activated
....(EngineCore pid=35) INFO 04-13 06:24:35 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.137.164:42089 backend=hccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) INFO 04-13 06:24:36 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) WARNING 04-13 06:24:36 [worker.py:306] Bind cpus failed in rank0: Can not get running npu info. Skip binding cpu.
(EngineCore pid=35) INFO 04-13 06:24:37 [model_runner_v1.py:2562] Starting to load model Qwen/Qwen2.5-7B-Instruct...
(EngineCore pid=35) INFO 04-13 06:24:56 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-7B-Instruct: 4.228922 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:10<00:30, 10.12s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:20<00:20, 10.32s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:30<00:10, 10.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00,  9.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00, 10.08s/it]
(EngineCore pid=35) 
(EngineCore pid=35) INFO 04-13 06:25:46 [default_loader.py:384] Loading weights took 40.53 seconds
(EngineCore pid=35) INFO 04-13 06:25:48 [model_runner_v1.py:2589] Loading model weights took 16.2391 GB
.(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.forward(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().forward(input_)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35)     return self.forward(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35)     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35)     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35)                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35)     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35)                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35)     return super().forward(input_)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35)     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35)     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35)     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35)     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35)     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-13-06:26:05 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

根据官网说明,似乎单张 310P 只能跑个 0.6B

Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards)

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

而且需要一些参数

vllm serve Qwen/Qwen3-0.6B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager \
    --dtype float16

Helm Update 报错

每次更新 helm 都需要经历以下几个回合,才能成功:

[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: cannot patch "vllm-pd-2p1d-01" with kind Deployment: Deployment.apps "vllm-pd-2p1d-01" is invalid: spec.selector: Invalid value: {"matchLabels":{"app.kubernetes.io/instance":"infernex-vllm-pd-2p1d-01","app.kubernetes.io/name":"inference-backend","openfuyao.com/dpSize":"1","openfuyao.com/engine":"vllm","openfuyao.com/model":"qwen-qwen3-0.6b","openfuyao.com/pdRole":"aggregate","openfuyao.com/tpSize":"2"}}: field is immutable
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference delete deployments.apps vllm-pd-2p1d-01 
deployment.apps "vllm-pd-2p1d-01" deleted from ai-inference namespace
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: post-upgrade hooks failed: warning: Hook post-upgrade infernex/charts/pd-orchestrator/charts/resourcescalinggroup/templates/webhook-wait-hook.yaml failed: 1 error occurred:
        * jobs.batch "infernex-resourcescalinggroup-wait-webhook" is forbidden: unable to create new content in namespace scaling-system because it is being terminated


[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: failed to create resource: namespaces "scaling-system" not found
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 


Release "infernex" has been upgraded. Happy Helming!
NAME: infernex
LAST DEPLOYED: Mon Apr 13 14:33:47 2026
NAMESPACE: ai-inference
STATUS: deployed
REVISION: 27
TEST SUITE: None

istio httproute 错误

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference describe httproutes.gateway.networking.k8s.io qwen-qwen3-0.6b-httproute 
Name:         qwen-qwen3-0.6b-httproute
Namespace:    ai-inference
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=infernex-epp
              app.kubernetes.io/version=0.21.0
Annotations:  meta.helm.sh/release-name: infernex
              meta.helm.sh/release-namespace: ai-inference
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2026-04-13T06:31:19Z
  Generation:          1
  Resource Version:    3664436
  UID:                 21dca6e8-483a-4b03-8a78-82559c45a7e3
Spec:
  Parent Refs:
    Group:  gateway.networking.k8s.io
    Kind:   Gateway
    Name:   inference-gateway
  Rules:
    Backend Refs:
      Group:   inference.networking.k8s.io
      Kind:    InferencePool
      Name:    qwen-qwen3-0.6b
      Weight:  1
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
    Timeouts:
      Request:  300s
Status:
  Parents:
    Conditions:
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               InferencePool.Name invalid; the name of the InferencePool must be used, not the hostname.
      Observed Generation:   1
      Reason:                InvalidDestination
      Status:                False
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:  gateway.networking.k8s.io
      Kind:   Gateway
      Name:   inference-gateway
Events:       <none>

总结

  • helm 默认申请 2 张 310P ,需手动修改
resources:
  limits:
    cpu: "8"
    huawei.com/Ascend310P: "1"
    memory: 64Gi
  requests:
    cpu: "4"
    huawei.com/Ascend310P: "1"
    memory: 32Gi
  • huggerface 下载需手动配置国内源
  - name: HF_ENDPOINT
    value: https://hf-mirror.com
  • vllm 启动参数需手动调整
vllm serve Qwen/Qwen3-0.6B \
  --served-model-name Qwen/Qwen3-0.6B \
  --trust-remote-code \
  --enable-prefix-caching \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --enforce-eager \
  --dtype float16 \
  --max-num-batched-tokens 40960 \
  --data-parallel-size 1 \
  --gpu-memory-utilization 0.8 \
  --block-size 128 \
  --kv-events-config '{"enable_kv_cache_events": true, "publisher":"zmq", "topic":"kv-events"}'
  • 默认服务采用 ClusterAPI 使用 Gateway istio 暴露服务,目前还没有正常工作
  • vllm pod 看到频繁打印
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK

Refs