Version: Next

Helm

用于在 Kubernetes 上部署 vLLM 的 Helm Chart。

Helm 是 Kubernetes 的软件包管理器，它可以帮助您在 k8s 上部署 vLLM，并自动化 vLLM Kubernetes 应用程序的部署。借助 Helm，您可以通过覆盖变量值，在多个命名空间中使用不同的配置部署相同的框架架构。

本指南将引导您完成使用 Helm 部署 vLLM 的过程，包括必要的先决条件、Helm 安装步骤以及架构和 values.yaml 配置文件的相关文档。

依赖

在开始之前，请确保您具备以下条件：

运行中的 Kubernetes 集群
NVIDIA Kubernetes 设备插件（k8s-device-plugin）：可在 NVIDIA/k8s-device-plugin 找到
集群中可用的 GPU 资源
包含要部署模型的 S3 存储

安装 Helm Chart

使用 test-vllm 作为发布名称安装 Chart，运行以下命令：

helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY

卸载 Helm Chart

如果要卸载 test-vllm 部署，可以运行以下命令：

helm uninstall test-vllm --namespace=ns-vllm

该命令将删除与 Chart 相关的所有 Kubernetes 组件（包括持久卷），并删除该发布。

架构

值

键 (Key)	类型	默认值	描述
autoscaling	object	{“enabled”:false,”maxReplicas”:100,”minReplicas”:1,”targetCPUUtilizationPercentage”:80}	Autoscaling configuration
autoscaling.enabled	bool	false	Enable autoscaling
autoscaling.maxReplicas	int	100	Maximum replicas
autoscaling.minReplicas	int	1	Minimum replicas
autoscaling.targetCPUUtilizationPercentage	int	80	Target CPU utilization for autoscaling
configs	object	{}	Configmap
containerPort	int	8000	Container port
customObjects	list	[]	Custom Objects configuration
deploymentStrategy	object	{}	Deployment strategy configuration
externalConfigs	list	[]	External configuration
extraContainers	list	[]	Additional containers configuration
extraInit	object	{“pvcStorage”:”1Gi”,”s3modelpath”:”relative_s3_model_path/opt-125m”, “awsEc2MetadataDisabled”: true}	Additional configuration for the init container
extraInit.pvcStorage	string	“50Gi”	Storage size of the s3
extraInit.s3modelpath	string	“relative_s3_model_path/opt-125m”	Path of the model on the s3 which hosts model weights and config files
extraInit.awsEc2MetadataDisabled	boolean	true	Disables the use of the Amazon EC2 instance metadata service
extraPorts	list	[]	Additional ports configuration
gpuModels	list	[“TYPE_GPU_USED”]	Type of gpu used
image	object	{“command”:[“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”],”repository”:”vllm/vllm-openai”,”tag”:”latest”}	Image configuration
image.command	list	[“vllm”,”serve”,”/data/”,”–served-model-name”,”opt-125m”,”–host”,”0.0.0.0”,”–port”,”8000”]	Container launch command
image.repository	string	“vllm/vllm-openai”	Image repository
image.tag	string	“latest”	Image tag
livenessProbe	object	{“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:15,”periodSeconds”:10}	Liveness probe configuration
livenessProbe.failureThreshold	int	3	Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
livenessProbe.httpGet	object	{“path”:”/health”,”port”:8000}	Configuration of the Kubelet http request on the server
livenessProbe.httpGet.path	string	“/health”	Path to access on the HTTP server
livenessProbe.httpGet.port	int	8000	Name or number of the port to access on the container, on which the server is listening
livenessProbe.initialDelaySeconds	int	15	Number of seconds after the container has started before liveness probe is initiated
livenessProbe.periodSeconds	int	10	How often (in seconds) to perform the liveness probe
maxUnavailablePodDisruptionBudget	string	“”	Disruption Budget Configuration
readinessProbe	object	{“failureThreshold”:3,”httpGet”:{“path”:”/health”,”port”:8000},”initialDelaySeconds”:5,”periodSeconds”:5}	Readiness probe configuration
readinessProbe.failureThreshold	int	3	Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
readinessProbe.httpGet	object	{“path”:”/health”,”port”:8000}	Configuration of the Kubelet http request on the server
readinessProbe.httpGet.path	string	“/health”	Path to access on the HTTP server
readinessProbe.httpGet.port	int	8000	Name or number of the port to access on the container, on which the server is listening
readinessProbe.initialDelaySeconds	int	5	Number of seconds after the container has started before readiness probe is initiated
readinessProbe.periodSeconds	int	5	How often (in seconds) to perform the readiness probe
replicaCount	int	1	Number of replicas
resources	object	{“limits”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1},”requests”:{“cpu”:4,”memory”:”16Gi”,”nvidia.com/gpu”:1}}	Resource configuration
resources.limits.”nvidia.com/gpu”	int	1	Number of gpus used
resources.limits.cpu	int	4	Number of CPUs
resources.limits.memory	string	“16Gi”	CPU memory configuration
resources.requests.”nvidia.com/gpu”	int	1	Number of gpus used
resources.requests.cpu	int	4	Number of CPUs
resources.requests.memory	string	“16Gi”	CPU memory configuration
secrets	object	{}	Secrets configuration
serviceName	string		Service name
servicePort	int	80	Service port
labels.environment	string	test	Environment name
labels.release	string	test

依赖​

安装 Helm Chart​

卸载 Helm Chart​

架构​

值​

依赖

安装 Helm Chart

卸载 Helm Chart

架构

值