使用 Kubernetes
在 Kubernetes 上部署 vLLM 是一种可扩展且高效的方式来提供机器学习模型服务。本指南将引导您使用原生 Kubernetes 部署 vLLM。
此外,您还可以使用任意下列中的项目部署 vLLM 到 Kubernetes:
- Helm
- InftyAI/llmaz
- KServe
- kubernetes-sigs/lws
- meta-llama/llama-stack
- substratusai/kubeai
- vllm-project/aibrix
- vllm-project/production-stack
使用 CPU 部署
注意: 此处的 CPU 仅用于演示和测试目的,其性能无法与 GPU 相比。
首先,创建一个 Kubernetes PVC 和 Secret 来下载和存储 Hugging Face 模型:
cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
data:
token: $(HF_TOKEN)
接下来,将 vLLM 服务器作为 Kubernetes 部署和服务启动:
cat <<EOF |kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: vllm
template:
metadata:
labels:
app.kubernetes.io/name: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve meta-llama/Llama-3.2-1B-Instruct"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
EOF
我们可以通过日志验证 vLLM 服务器是否已成功启动(下载模型可能需要几分钟时间):
kubectl logs -l app.kubernetes.io/name=vllm
...
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
使用 GPU 部署
**依赖:**请确保您已有一个运行中的 Kubernetes 环境,并支持 GPU。
- 创建 PVC(持久化存储)、Secret 和 Deployment 资源
PVC 用于存储模型缓存,该步骤是可选的。您可以使用 hostPath 或其他存储方式。
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
Secret 仅在需要访问受限模型时才需要,如果不使用受限模型,则可以跳过此步骤。
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: default
type: Opaque
stringData:
token: "REPLACE_WITH_TOKEN"
接下来,创建 vLLM 的 Deployment 资源,以运行模型服务器。以下示例部署 Mistral-7B-Instruct-v0.3
模型。
以下为两个使用 NVIDIA GPU 和 AMD GPU 的示例。
NVIDIA GPU:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
AMD GPU:
如果使用类似 MI00X 的 AMD ROCm GPU,你可以参考下面的 deployment.yaml
。
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
# PVC
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
您可以在 ROCm/k8s-device-plugin 获取完整示例及 YAML 配置文件。
- 为 vLLM 创建 Kubernetes Service
下一步创建 Kubernetes Service 来公开 mistral-7b
部署:
apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: default
spec:
ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: mistral-7b
sessionAffinity: None
type: ClusterIP
- 部署与测试
使用 kubectl apply
命令应用 Deployment 和 Service 配置:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
然后运行以下 curl
命令测试部署是否成功:
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
如果返回有效的响应,则说明 vLLM 在 Kubernetes 上部署成功。
结论
使用 Kubernetes 部署 vLLM 可以高效地扩展和管理利用 GPU 资源的机器学习模型。按照上述步骤操作,您应该能够在 Kubernetes 集群中成功设置并测试 vLLM 部署。
如果您遇到任何问题或有改进建议,欢迎为文档做出贡献。