Skip to main content
Version: Next

Helm

*在线运行 vLLM 入门教程:零基础分步指南

用于在 Kubernetes 上部署 vLLM 的 Helm Chart。

Helm 是 Kubernetes 的软件包管理器,它可以帮助您在 k8s 上部署 vLLM,并自动化 vLLM Kubernetes 应用程序的部署。借助 Helm,您可以通过覆盖变量值,在多个命名空间中使用不同的配置部署相同的框架架构。

本指南将引导您完成使用 Helm 部署 vLLM 的过程,包括必要的先决条件、Helm 安装步骤以及架构和 values.yaml 配置文件的相关文档。

依赖

在开始之前,请确保您具备以下条件:

  • 运行中的 Kubernetes 集群

  • NVIDIA Kubernetes 设备插件(k8s-device-plugin):可在 NVIDIA/k8s-device-plugin 找到

  • 集群中可用的 GPU 资源

  • 包含要部署模型的 S3 存储

安装 Helm Chart

使用 test-vllm 作为发布名称安装 Chart,运行以下命令:

helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY

卸载 Helm Chart

如果要卸载 test-vllm 部署,可以运行以下命令:

helm uninstall test-vllm --namespace=ns-vllm

该命令将删除与 Chart 相关的所有 Kubernetes 组件(包括持久卷),并删除该发布。

架构

图片

键 (Key)类型默认值描述
autoscalingobject{“enabled”:false,”maxReplicas”:100,”minReplicas”:1,”targetCPUUtilizationPercentage”:80}Autoscaling configuration
autoscaling.enabledboolfalseEnable autoscaling
autoscaling.maxReplicasint100Maximum replicas
autoscaling.minReplicasint1Minimum replicas
autoscaling.targetCPUUtilizationPercentageint80Target CPU utilization for autoscaling
configsobject{}Configmap
containerPortint8000Container port
customObjectslist[]Custom Objects configuration
deploymentStrategyobject{}Deployment strategy configuration
externalConfigslist[]External configuration
extraContainerslist[]Additional containers configuration
extraInitobject{“pvcStorage”:”1Gi”,”s3modelpath”:”relative_s3_model_path/opt-125m”, “awsEc2MetadataDisabled”: true}Additional configuration for the init container
extraInit.pvcStoragestring“50Gi”Storage size of the s3
extraInit.s3modelpathstring“relative_s3_model_path/opt-125m”Path of the model on the s3 which hosts model weights and config files
extraInit.awsEc2MetadataDisabledbooleantrueDisables the use of the Amazon EC2 instance metadata service
extraPortslist[]Additional ports configuration
gpuModelslist[“TYPE_GPU_USED”]Type of gpu used
imageobject{“command”:[“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”],”repository”:”vllm/vllm-openai”,”tag”:”latest”}Image configuration
image.commandlist[“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”]Container launch command
image.repositorystring“vllm/vllm-openai”Image repository
image.tagstring“latest”Image tag
livenessProbeobject{“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:15,”periodSeconds”:10}Liveness probe configuration
livenessProbe.failureThresholdint3Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
livenessProbe.httpGetobject{“path”:”/health”,”port”:8000}Configuration of the Kubelet http request on the server
livenessProbe.httpGet.pathstring“/health”Path to access on the HTTP server
livenessProbe.httpGet.portint8000Name or number of the port to access on the container, on which the server is listening
livenessProbe.initialDelaySecondsint15Number of seconds after the container has started before liveness probe is initiated
livenessProbe.periodSecondsint10How often (in seconds) to perform the liveness probe
maxUnavailablePodDisruptionBudgetstring“”Disruption Budget Configuration
readinessProbeobject{“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:5,”periodSeconds”:5}Readiness probe configuration
readinessProbe.failureThresholdint3Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
readinessProbe.httpGetobject{“path”:”/health”,”port”:8000}Configuration of the Kubelet http request on the server
readinessProbe.httpGet.pathstring“/health”Path to access on the HTTP server
readinessProbe.httpGet.portint8000Name or number of the port to access on the container, on which the server is listening
readinessProbe.initialDelaySecondsint5Number of seconds after the container has started before readiness probe is initiated
readinessProbe.periodSecondsint5How often (in seconds) to perform the readiness probe
replicaCountint1Number of replicas
resourcesobject{“limits”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1},”requests”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1}}Resource configuration
resources.limits.”nvidia.com/gpu”int1Number of gpus used
resources.limits.cpuint4Number of CPUs
resources.limits.memorystring“16Gi”CPU memory configuration
resources.requests.”nvidia.com/gpu”int1Number of gpus used
resources.requests.cpuint4Number of CPUs
resources.requests.memorystring“16Gi”CPU memory configuration
secretsobject{}Secrets configuration
serviceNamestringService name
servicePortint80Service port
labels.environmentstringtestEnvironment name
labels.releasestringtest