龙空技术网

阿里云Kubernetes 1.9上利用Helm玩转TensorFlow模型预测

阿里云云栖号 1811

前言:

此刻你们对“阿里云 虚拟机安全启动”大概比较关注,你们都需要了解一些“阿里云 虚拟机安全启动”的相关资讯。那么小编同时在网上汇集了一些有关“阿里云 虚拟机安全启动””的相关资讯,希望咱们能喜欢,你们一起来了解一下吧!

摘要: TensorFlow Serving是Google开源的机器学习模型预测系统,能够简化并加速从模型到生产应用的过程。 它实际上也是一个在线服务,我们需要考虑它的部署时刻的安装配置,运行时刻的负载均衡,弹性伸缩,高可用性以及滚动升级等问题,幸运的是这正是Kubernetes擅长的地方。

TensorFlow Serving是由Google开源的机器学习模型预测系统,能够简化并加速从模型到生产应用的过程。它可以将训练好的机器学习模型部署到线上,使用 gRPC 作为接口接受外部调用。更给人惊喜后的是,它还提供了不宕机的模型更新和版本管理。这大大降低了模型提供商在线上管理的复杂性,可以将注意力都放在模型优化上。

TensorFlow Serving本质上也是一个在线服务,我们需要考虑它的部署时刻的安装配置,运行时刻的负载均衡,弹性伸缩,高可用性以及滚动升级等问题,幸运的是这正是Kubernetes擅长的地方。利用Kubernetes的内置自动化能力,将极大的降低TensorFLow Serving应用运维的成本。

今天将介绍如何利用Kubernetes的官方包管理工具Helm在阿里云容器服务上准备模型,部署TensorFlow Serving,并且进行手动扩容。

1. 准备模型

由于TensorFLow Serving需要用持久化存储加载预测模型,这里就需要准备相应的存储。在阿里云容器服务里,您可以选择NAS,OSS和云盘,具体可以参考文档阿里云Kubernetes的存储管理。本文以NAS存储为例介绍如何导入数据模型。

1.1 创建NAS文件存储,并且设置vpc内挂载点。可以参考阿里云NAS文档。并且查看挂载点,这里假设挂载点为3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com

1.2 利用一台阿里云虚拟机准备模型数据,首先创建文件夹。

mkdir /nfsmount -t nfs -o vers=4.0 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com:/ /nfsmkdir -p /nfs/servingumount /nfs

1.3 下载预测模型并且保存到NAS里

mkdir /servingmount -t nfs -o vers=4.0 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com:/serving /servingmkdir -p /serving/modelcd /serving/modelcurl -O  -xzvf mnist-export.tar.gzrm -rf mnist-export.tar.gzcd /

1.4 这样你可以就可以很直观的看到预测模型的内容,检查后可以umount掉挂载点

tree /serving/model/mnist/serving/model/mnist└── 1 ├── saved_model.pb └── variables ├── variables.data-00000-of-00001 └── variables.indexumount /serving
2. 创建持久化数据卷

2.1 以下为创建NAS的nas.yaml样例

---apiVersion: v1kind: PersistentVolumemetadata: labels: model: mnist name: pv-nasspec: persistentVolumeReclaimPolicy: Retain accessModes: - ReadWriteMany capacity: storage: 5Gi flexVolume: driver: alicloud/nas options: mode: "755" path: /serving/model/mnist server: 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com vers: "4.0"

注意这里需要指定label为model: mnist, storageClassName需要为nas, 这两个标签对于pvc选择pv绑定非常重要。

另外和NAS相关的具体配置可以参考Kubernetes使用阿里云NAS

2.2 在Kubernetes管理控制台,选择持久化存储卷

2.3 稍等片刻后,可以看到持久化存储卷已经创建成功了

当然也可以运行kubectl命令创建

kubectl create -f nas.yamlpersistentvolume "pv-nas" created
3. 通过Helm部署TensorFlow Serving的应用

3.1 可以通过应用目录,点击acs-tensorflow-serving

3.2 点击参数, 就可以通过修改参数配置点击部署

创建支持GPU的自定义配置参数:

---serviceType: LoadBalancer## expose the service to the grpc clientport: 9090replicas: 1image: "registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tensorflow-serving:1.4.0-devel-gpu"imagePullPolicy: "IfNotPresent"## the gpu resource to claim, for cpu, change it to 0gpuCount: 1## The command and args to run the podcommand: ["/usr/bin/tensorflow_model_server"]args: [ "--port=9090", "--model_name=mnist", "--model_base_path=/serving/model/mnist"] ## the mount path inside the containermountPath: /serving/model/mnistpersistence:## The request and label to select the persistent volume pvc: storage: 5Gi matchLabels: model: mnist

创建支持非GPU的自定义配置参数:

---serviceType: LoadBalancer## expose the service to the grpc clientport: 9090replicas: 1command: - /usr/bin/tensorflow_model_serverargs: - "--port=9090" - "--model_name=mnist" - "--model_base_path=/serving/model/mnist"image: "registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tensorflow-serving:1.4.0-devel"imagePullPolicy: "IfNotPresent"mountPath: /serving/model/mnistpersistence: mountPath: /serving/model/mnist pvc: matchLabels: model: mnist storage: 5Gi

也可以登录到Kubernetes master运行以下命令

# helm install --values serving.yaml --name mnist incubator/acs-tensorflow-serving
4. 查看TensorFlow-serving的应用部署

4.1 登录到Kubernetes的master上利用helm命令查看部署应用的列表

# helm listNAME REVISION UPDATED STATUS CHART NAMESPACEmnist-deploy 1 Fri Mar 16 19:24:35 2018 DEPLOYED acs-tensorflow-serving-0.1.0 default

4.2 利用helm status命令检查具体应用的配置

# helm status mnist-deployLAST DEPLOYED: Fri Mar 16 19:24:35 2018NAMESPACE: defaultSTATUS: DEPLOYEDRESOURCES:==> v1/ServiceNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEmnist-deploy-acs-tensorflow-serving LoadBalancer 172.19.0.219 139.195.1.216 9090:32560/TCP 5h==> v1beta1/DeploymentNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEmnist-deploy-serving 1 1 1 1 5h==> v1/Pod(related)NAME READY STATUS RESTARTS AGEmnist-deploy-serving-665fc69d84-pk9bk 1/1 Running 0 5h

TensoFlow Serving的对外服务地址是ExTERNAL_IP: 139.195.1.216,端口为9090

对应部署的是mnist-deploy-serving,这个信息在扩容时刻是需要的

4.3 查看tensorflow-serving的下pod的日志,发现mnist的模型已经加载到内存里,并且GPU已经正常启动

# kubectl logs mnist-deploy-serving-665fc69d84-pk9bk2018-03-16 11:28:08.393864: I tensorflow_serving/model_servers/main.cc:147] Building single TensorFlow model file config: model_name: mnist model_base_path: /serving/model/mnist2018-03-16 11:28:08.394115: I tensorflow_serving/model_servers/server_core.cc:441] Adding/updating models.2018-03-16 11:28:08.394174: I tensorflow_serving/model_servers/server_core.cc:492] (Re-)adding model: mnist2018-03-16 11:28:08.504522: I tensorflow_serving/core/basic_manager.cc:705] Successfully reserved resources to load servable {name: mnist version: 1}2018-03-16 11:28:08.504591: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}2018-03-16 11:28:08.504610: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}2018-03-16 11:28:08.504643: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /serving/model/mnist/12018-03-16 11:28:08.504674: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /serving/model/mnist/12018-03-16 11:28:08.703464: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2018-03-16 11:28:08.703865: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285pciBusID: 0000:00:08.0totalMemory: 15.89GiB freeMemory: 15.60GiB2018-03-16 11:28:08.703899: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)2018-03-16 11:28:08.898765: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle.2018-03-16 11:30:26.306194: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle.2018-03-16 11:30:26.309782: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 137805089 microseconds.2018-03-16 11:30:26.320057: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist version: 1}E0316 11:30:26.322709112 1 ev_epoll1_linux.c:1051] grpc epoll fd: 232018-03-16 11:30:26.324023: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:9090 ...

5. 根据前面获得的外部地址139.195.1.216,在本地启动客户端程序测试

# docker run -it --rm registry.cn-beijing.aliyuncs.com/tensorflow-samples/tf-mnist:grpcio_upgraded /serving/bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=139.195.1.216:9090Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.Extracting /tmp/train-images-idx3-ubyte.gzSuccessfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.Extracting /tmp/train-labels-idx1-ubyte.gzSuccessfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.Extracting /tmp/t10k-images-idx3-ubyte.gzSuccessfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.Extracting /tmp/t10k-labels-idx1-ubyte.gz........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Inference error rate: 10.4%

6. 扩容TensoFlow Serving,

因为helm命令无法实现扩容的能力,这里需要使用kubectl原生命令。输入的参数有两个,一个是扩容目标2, 另一个是通过helm status查询到的Deployment

# kubectl scale --replicas 2 deployment/mnist-deploy-servingdeployment "mnist-deploy-serving" scaled

通过time helm status mnist-deploy查询到目前的TensoFlow Serving实例数为2

# helm status mnist-deployLAST DEPLOYED: Fri Mar 16 19:24:35 2018NAMESPACE: defaultSTATUS: DEPLOYEDRESOURCES:==> v1/ServiceNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEmnist-deploy-acs-tensorflow-serving LoadBalancer 172.19.0.219 139.196.1.217 9090:32560/TCP 5h==> v1beta1/DeploymentNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGEmnist-deploy-serving 2 2 2 2 5h==> v1/Pod(related)NAME READY STATUS RESTARTS AGEmnist-deploy-serving-665fc69d84-7sfvn 1/1 Running 0 9mmnist-deploy-serving-665fc69d84-pk9bk 1/1 Running 0 5h

总结

本文向您展示了如何利用阿里云Kubernetes容器服务快速使用开箱即用的TensoFlow Serving能力,并且支持一键式的扩缩容,释放了深度学习的洪荒之力。同时阿里云Kubernetes为深度学习提供了丰富的基础设施能力,从弹性计算、负责均衡到对象存储,日志、监控等等。将二者结合起来,可以帮助数据科学家专注于模型本身,无需在应用运维方面牵扯过多的精力。

阿里云容器服务团队也会在提供简单易用的GPU加速和深度学习解决方案方面持续发力,进一步提高云端深度学习训练和预测的效能。

标签: #阿里云 虚拟机安全启动