龙空技术网

lxcfs容器资源视图隔离 for k8s

DecOpSec 126

前言:

现在我们对“ubuntu更新uvc”大概比较看重,大家都想要分析一些“ubuntu更新uvc”的相关内容。那么小编在网上汇集了一些对于“ubuntu更新uvc””的相关知识,希望咱们能喜欢,朋友们一起来了解一下吧!

背景

k8s版本1.25.6,业务k8s容器化,虚机里进程迁移到容器里后,运维在执行free -m top等命令排查问题时一脸迷惑,显示内存还有很多结果pod的容器被oomCPU资源显示很多核且空闲很多资源进程却运行很慢,我们看到的资源视图是物理机的而非我们做了限定pod里容器的资源,这给研发和运维排查问题带来一定的干扰。

那是什么原因导致运维看到的资源视图还是物理机的呢?

我们知道容器通过cgroupCPU内存交换空间等资源进行限制,但是容器并不是完全独立隔离的,它与主机共享内核,因此可以访问主机上的一些信息。在Linux系统中,/proc目录下存放了许多虚拟文件,它们提供了对系统内核和运行时信息的访问。/proc/meminfo文件包含了关于内存使用和状态的信息,例如总内存大小、可用内存、已使用内存等。当在容器里执行free -m时,实际上是在访问主机上的/proc/meminfo文件的信息,所以展示的是物理机的内存信息。

我们知道什么原因导致的容器资源视图没有隔离的问题,在实际的使用过程中除了有迷惑还会有一些痛点:

1. 比如nginx 根据CPU核数自动设置worker数量。2. jvm程序内存根据系统内存大小自动设置jvm大小,导致进程启动不了或者运行过程中经常oom3. 信息的过度泄露可能会危害物理机的安全等。

那怎么解决容器资源视图隔离的问题? Linux容器(LXC)社区早就意识到上述问题,他们开发了LXCFS(Linux Containers File System)来解决容器资源视图隔离的问题。

下面来看看LXCFS的工作原理。

LXCFS工作原理

LXCFS是一个使用FUSE(Filesystem in Userspace)实现的小型虚拟文件系统,旨在让Linux容器感觉更像一个虚拟机。它最初是LXC的一个附带项目,但可由任何运行时使用。

LXCFS确保procfs中关键文件提供的信息是针对容器的,例如:

/proc/cpuinfo/proc/diskstats/proc/meminfo/proc/stat/proc/swaps/proc/uptime/proc/slabinfo/sys/devices/system/cpu/online

LXCFS将这些信息适配到容器内,以便显示的值(例如/proc/uptime)真正反映容器的运行时间,而不是主机的运行时间。

LXCFS在容器内部创建了一个虚拟的文件系统,通过挂载主机上的一些关键目录(如/proc/sys等)到容器内部的对应目录下,使得容器内的进程可以看到主机上的资源信息,同时,LXCFS通过自己的逻辑和计算,提供了对这些资源信息的虚拟视图,使得容器内部能够看到主机上实际的资源使用情况。

1. 容器里执行free -m,读取文件/proc/meminfo2. 因为/proc/meminfo文 件是挂载的,所以会读取/var/lib/lxcfs/proc/meminfo文件见下文,这就触发了LXCFS的工作机制3. LXCFS文件系通过gblic系统调用vfs接口然后转向Fuse内核模块4. FUSE回调用户空间LXCFS文件系统实现接口,获取容器的cgroup信息5. LXCFS实现根据容器id获取并计算cgroup下被限制容器的实际memcpu等信息,最终返回给用户看到的结果就是cgroup 限制的资源视图。LXCFS机器上部署a. 安装lxcfs

yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel -ygit clone git://github.com/lxc/lxcfscd lxcfsmeson setup -Dinit-script=systemd --prefix=/usr build/meson compile -C build/meson install -C build/
b. 启动lxcfs
mkdir -p /var/lib/lxcfslxcfs /var/lib/lxcfs
c. 测试运行容器
docker run -it -m 256m --memory-swap 256m --cpus=1 \      -v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \      -v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \      -v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \      -v /var/lib/lxcfs/proc/stat:/proc/stat:rw \      -v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \      -v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \      -v /var/lib/lxcfs/proc/slabinfo:/proc/slabinfo:rw \      -v /var/lib/lxcfs/sys/devices/system/cpu:/sys/devices/system/cpu:rw \      ubuntu:18.04 /bin/bash

启动容器后,执行如下命令确认是否生效

1. uptime   #容器启动时间2. free -m  #内存情况3. lscpu  #看online cpu 核数 或者 cat /proc/cpuinfo

k8s 环境下怎么为pod加上资源视图隔离呢?下面我们来看一看

LXCFSk8s 环境运行

解决步骤:

1. 首先要使lxcfs进程在所有的node上运行,这个我们使用damonset解决2. 其次挂载node上的/sys/fs/cgroup/usr/lib64/usr/locallxcfs里,把lxcfs 容器里虚拟文件系统/var/lib/lxcfs/通过hostPath挂载到物理机上3. 最后创建podyaml,通过hostPath形式把node/var/lib/lxcfs/ 挂载到pod的容器里,这样就完成了lxcfs 解决k8s 容器资源视图隔离的问题。a. 构建lxcfs镜像a.1 目录结构

tree .                                                                             .├── Dockerfile├── build.sh└── lxcfs-lxcfs-5.0.4.tar.gz
a.2 Dockerfile
FROM centos:7.9  #或者制定你的基础镜像#安装RUN yum install meson fuse-devel fuse cmake help2man fuse3 fuse3-devel git -yRUN git clone git://github.com/lxc/lxcfs && cd lxcfsRUN meson setup -Dinit-script=systemd --prefix=/usr build/RUN meson compile -C build/RUN meson install -C build/#运行RUN mkdir -p /var/lib/lxcfsCMD ["sh", "-c", "lxcfs /var/lib/lxcfs"]
a.3 build.sh 构建镜像
#!/bin/bashsource /etc/profiledocker build -t yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs .docker push yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs

到这里lxcfs镜像就构建完了,下面看看怎么用此镜像

b. 运行lxcfsdaemonsetyaml

使用构建的lxcfs镜像,挂载node文件到pod同时挂载/var/lib/lxcfs/node上,见下述yaml

apiVersion: apps/v1kind: DaemonSetmetadata:  annotations:  labels:    app: lxcfs  name: lxcfs  namespace: defaultspec:  revisionHistoryLimit: 10  selector:    matchLabels:      app: lxcfs  template:    metadata:      labels:        app: lxcfs    spec:      containers:      - yourharbor.domain.com/centos/7.9/lxcfs/5.0.4/lxcfs        imagePullPolicy: Always        name: lxcfs        resources: {}        securityContext:          privileged: true        volumeMounts:        - mountPath: /sys/fs/cgroup          name: cgroup        - mountPath: /var/lib/lxcfs          mountPropagation: Bidirectional          name: lxcfs        - mountPath: /usr/local          name: usr-local        - mountPath: /usr/lib64          name: usr-lib64      hostPID: true      imagePullSecrets:      - name: your-docker-token      restartPolicy: Always      tolerations:      - effect: NoSchedule        key: node-role.kubernetes.io/master      - effect: NoSchedule        key: your-taint-key        operator: Exists      volumes:      - hostPath:          path: /sys/fs/cgroup          type: ""        name: cgroup      - hostPath:          path: /usr/local          type: ""        name: usr-local      - hostPath:          path: /usr/lib64          type: ""        name: usr-lib64      - hostPath:          path: /var/lib/lxcfs          type: DirectoryOrCreate        name: lxcfs

apply上述yaml后可能个别nodelxcfs daemonset pod 启动保如下错误

Error: failed to generate container "974c6c0465adae1a244e3416b3e053ba2dccb0cbd123c2d02317c9301e3f83d0" spec: failed to apply OCI options: failed to stat "/var/lib/lxcfs": stat /var/lib/lxcfs: transport endpoint is not connected

解决办法

umount /var/lib/lxcfs
c. 验证 deployment pod yaml 定义
apiVersion: apps/v1kind: Deploymentmetadata:  name: webspec:  replicas: 2  selector:    matchLabels:      app: web   template:    metadata:      labels:        app: web    spec:      volumes:        - hostPath:            path: /var/lib/lxcfs/proc/cpuinfo            type: ""          name: lxcfs-proc-cpuinfo        - hostPath:            path: /var/lib/lxcfs/proc/diskstats            type: ""          name: lxcfs-proc-diskstats        - hostPath:            path: /var/lib/lxcfs/proc/meminfo            type: ""          name: lxcfs-proc-meminfo        - hostPath:            path: /var/lib/lxcfs/proc/stat            type: ""          name: lxcfs-proc-stat        - hostPath:            path: /var/lib/lxcfs/proc/swaps            type: ""          name: lxcfs-proc-swaps        - hostPath:            path: /var/lib/lxcfs/proc/uptime            type: ""          name: lxcfs-proc-uptime        - hostPath:            path: /var/lib/lxcfs/proc/loadavg            type: ""          name: lxcfs-proc-loadavg        - hostPath:            path: /var/lib/lxcfs/sys/devices/system/cpu/online            type: ""          name: lxcfs-sys-devices-system-cpu-online      containers:        - name: web          image: httpd:2.4.32          imagePullPolicy: Always          resources:            requests:              memory: "256Mi"              cpu: "500m"            limits:              memory: "256Mi"              cpu: "500m"          volumeMounts:            - mountPath: /proc/cpuinfo              name: lxcfs-proc-cpuinfo              readOnly: true            - mountPath: /proc/meminfo              name: lxcfs-proc-meminfo              readOnly: true            - mountPath: /proc/diskstats              name: lxcfs-proc-diskstats              readOnly: true            - mountPath: /proc/stat              name: lxcfs-proc-stat              readOnly: true            - mountPath: /proc/swaps              name: lxcfs-proc-swaps              readOnly: true            - mountPath: /proc/uptime              name: lxcfs-proc-uptime              readOnly: true            - mountPath: /proc/loadavg              name: lxcfs-proc-loadavg              readOnly: true            - mountPath: /sys/devices/system/cpu/online              name: lxcfs-sys-devices-system-cpu-online              readOnly: true

这样pod通过lxcfs实现了容器资源视图隔离。

但这里有一个问题一个两个容器这样复制粘贴设置还能接受,成千上万和容器这种重复操作,作为追求KISS原则的你肯定不能忍。

那有没有办法解决呢?我们可以通过实现 admission-webhook (准入控制 Admission Control)在授权后对请求做进一步的验证或添加默认参数。我们想到的前辈们都已经实现,就不用重复造轮子了。可以参考 lxcfs-admission-webhook

lxcfs-admission-webhook 注入实现容器自动挂载/proc、/sys/

lxcfs-admission-webhook实现了一个动态的准入webhook,更准确的讲是实现了一个修改性质的webhook,即监听pod的创建,然后对pod执行patch的操作,从而将lxcfs与容器内的目录映射关系植入到pod创建的yaml中从而实现自动挂载。

使用上也比较KISS,只用在资源文件里加一条注解即可。

下面我们看看怎么玩

1. 准备lxcfs-admission-webhook镜像

go build 二进制

git clone git@github.com:denverdino/lxcfs-admission-webhook.gitcd lxcfs-admission-webhook# build lxcfs-admission-webhook,因为是老的go项目需要转成支持go modexport GOPROXY= mod init v1go mody tidyCGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o lxcfs-admission-webhookchmod +x lxcfs-admission-webhook

Dockerfile

FROM alpine:latestADD lxcfs-admission-webhook /lxcfs-admission-webhookENTRYPOINT ["./lxcfs-admission-webhook"]

构建镜像

docker build -t yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1 .docker push yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1
2. 运行lxcfs-admission-webhookpod

每个集群都有自己的CA证书,所以不同集群部署lxcfs-admission-webhook,先做如下操作再应用yaml

2.1 目录结构

tree .                                  .├── dp.yaml    #lxcfs-admission-webhook deployment├── mutatingwebhook.yaml   #MutatingWebhookConfiguration└── svc.yaml   #webhook svc└── webhook-create-signed-cert.sh  #创建`lxcfs-admission-webhook`依赖证书
2.2 修改webhook-create-signed-cert.sh

注:由于k8s版本较新,lxcfs-admission-webhook近几年没有更新,所以适配新版本k8s修改了github上的k8s的证书生成脚本webhook-create-signed-cert.sh

#!/bin/bashset -eusage() {    cat <<EOFGenerate certificate suitable for use with an sidecar-injector webhook service.This script uses k8s' CertificateSigningRequest API to a generate acertificate signed by k8s CA suitable for use with sidecar-injector webhookservices. This requires permissions to create and approve CSR. See fordetailed explantion and additional instructions.The server key/cert k8s CA cert are stored in a k8s secret.usage: ${0} [OPTIONS]The following flags are required.       --service          Service name of webhook.       --namespace        Namespace where webhook service and secret reside.       --secret           Secret name for CA certificate and server certificate/key pair.EOF    exit 1}while [[ $# -gt 0 ]]; do    case ${1} in        --service)            service="$2"            shift            ;;        --secret)            secret="$2"            shift            ;;        --namespace)            namespace="$2"            shift            ;;        *)            usage            ;;    esac    shiftdone[ -z ${service} ] && service=lxcfs-admission-webhook-svc[ -z ${secret} ] && secret=lxcfs-admission-webhook-certs[ -z ${namespace} ] && namespace=defaultif [ ! -x "$(command -v openssl)" ]; then    echo "openssl not found"    exit 1ficsrName=${service}.${namespace}tmpdir=$(mktemp -d)echo "creating certs in tmpdir ${tmpdir} "cat <<EOF >> ${tmpdir}/csr.conf[req]req_extensions = v3_reqdistinguished_name = req_distinguished_name[req_distinguished_name][ v3_req ]basicConstraints = CA:FALSEkeyUsage = nonRepudiation, digitalSignature, keyEnciphermentextendedKeyUsage = serverAuthsubjectAltName = @alt_names[alt_names]DNS.1 = ${service}DNS.2 = ${service}.${namespace}DNS.3 = ${service}.${namespace}.svcEOFopenssl genrsa -out ${tmpdir}/server-key.pem 2048#openssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=${service}.${namespace}.svc" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.confopenssl req -new -key ${tmpdir}/server-key.pem -subj "/CN=system:node:${service}.${namespace}.svc;/O=system:nodes" -out ${tmpdir}/server.csr -config ${tmpdir}/csr.conf# clean-up any previously created CSR for our service. Ignore errors if not present.kubectl delete csr ${csrName} -n ${namespace} 2>/dev/null || true# create  server cert/key CSR and  send to k8s APIcat <<EOF | kubectl -n ${namespace} create -f -apiVersion: certificates.k8s.io/v1kind: CertificateSigningRequestmetadata:  name: ${csrName}spec:  groups:  - system:authenticated  signerName: kubernetes.io/kubelet-serving  request: $(cat ${tmpdir}/server.csr | base64 | tr -d '\n')  usages:  - digital signature  - key encipherment  - server authEOF# verify CSR has been createdwhile true; do    kubectl get csr ${csrName}    if [ "$?" -eq 0 ]; then        break    fidone# approve and fetch the signed certificatekubectl certificate approve ${csrName}# verify certificate has been signedfor x in $(seq 10); do    serverCert=$(kubectl get csr ${csrName} -o jsonpath='{.status.certificate}')    if [[ ${serverCert} != '' ]]; then        break    fi    sleep 1doneif [[ ${serverCert} == '' ]]; then    echo "ERROR: After approving csr ${csrName}, the signed certificate did not appear on the resource. Giving up after 10 attempts." >&2    exit 1fiecho ${serverCert} | openssl base64 -d -A -out ${tmpdir}/server-cert.pem# create the secret with CA cert and server cert/keykubectl create secret generic ${secret} \        --from-file=key.pem=${tmpdir}/server-key.pem \        --from-file=cert.pem=${tmpdir}/server-cert.pem \        --dry-run -o yaml |    kubectl -n ${namespace} apply -f -

修改了证书请求命令/CN=system:node:${service}.${namespace}.svc;/O=system:nodes 和 修改了--namespace 的bug

然后在k8s master 节点上运行 kubectl create ns lxcfs ; sh webhook-create-signed-cert.sh --namespace lxcfs

2.2 获取集群CA证书内容

kubectl config view --raw --flatten --minify -o jsonpath='{.clusters[].cluster.certificate-authority-data}'
2.3 更新CA证书内容到mutatingwebhook.yamlcaBundle字段
apiVersion: admissionregistration.k8s.io/v1beta1kind: MutatingWebhookConfigurationmetadata:  name: mutating-lxcfs-admission-webhook-cfg  labels:    app: lxcfs-admission-webhookwebhooks:  - name: mutating.lxcfs-admission-webhook.aliyun.com    clientConfig:      service:        name: lxcfs-admission-webhook-svc        namespace: default        path: "/mutate"      caBundle: LS0tLS1CRUdJxiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1EY3hOekEwTXpNek5Gb1hEVE16TURjeE5EQTBNek16TkZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTlVZCjd4SThpcXZtbEtNN0FDTUFDY0huRWxxTXgyakR1b3JkWk81cUNGYTBNalROOXNqZHhUbHNNTlMrUHpuOUxPSkMKZ2d5TW90MGNPaW0zQTd2bllRYzFCY2I3UHFLOGpjS0U2a0E5MWVyNlpNSHU0c3ZXRXEybjVyMlIvcnY5NUR2eQpIRzlzTUJnenQrWUFJNlR6OGJNazhnMzJZR1BJejEvTTJmalBCa292bVJ3U0c1UkVIYWVFNW1TdDBRMnJheGJQCmtEU0pDSEErVlV3QThuekpFRVpwdkIxbUZ6MytXKzhrOUpIYlFtSW40TzhNaCtYYXlGc2Vab2g5SC9kVERkSXUKN0JXVG5pcmg5YkNWZzJhSDJidG03ZVpSY2s1V3IrM0QxcmUrc1FxWnpVdlhFSzBQYTk4MENGd3BYTVhsenlFdQpqNkhQRjZzOUhmV0gxOVdJMUdrQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZBQVVicWVyaklyUDRmOFV0ZjErUzRERzVSWStNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBTGx0OHBELzVtMnhVclJSdUJIdQpaODFKbnpDSzB6Y2ZhbHRROXFiWkFQb2syT1R6eTQrclh6SHQ4VzVHN01YVmN6TXVoZnh0OXFSeWVLekM3bmtICnpJSnIxcmxPbkkwaXdNcHJFeDlNQkpBTnBNdWNwN3ljaE82RGlOQ01ocFAwMXdDbWVENTBsVUladlIrMHhUbHEKaGVZdTFZS3Eza3Q0dzNuWVUxUGszUGU1Q3NweFNqd0NKNVF0RHpyUFY4bE5JaHNMZjRHV2U2bDN0N2J5ck9wWApsUWJiMXovazNRTDRTU3pqcEdkQVRmUnVmRmsrbk1RVkFCSmJwVWp5aHNFMlg1TjRvLzlKWFVpZVhLNlYxOHNiCnVtVUlLYlkySGIyTHNISXEveTBHeHpITnpGTndEeEdGNnNSWFF5SkFYVS9tekNWRWczbEhaWUlpUU9wdkc2VdfsZXFVPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0twx==    rules:      - operations: [ "CREATE" ]        apiGroups: ["core", ""]        apiVersions: ["v1"]        resources: ["pods"]    namespaceSelector:      matchLabels:        lxcfs-admission-webhook: enabled
2.4lxcfs-admission-webhook的dp.yaml
apiVersion: apps/v1kind: Deploymentmetadata:  name: lxcfs-admission-webhook-deployment  labels:    app: lxcfs-admission-webhook  namespace: lxcfsspec:  replicas: 1  selector:    matchLabels:      app: lxcfs-admission-webhook  template:    metadata:      labels:        app: lxcfs-admission-webhook    spec:      imagePullSecrets:        - name: your-docker-token      containers:        - name: lxcfs-admission-webhook          image: yourharbor.domain.com/alpine/lxcfs-admission-webhook:v1          imagePullPolicy: IfNotPresent          args:            - -tlsCertFile=/etc/webhook/certs/cert.pem            - -tlsKeyFile=/etc/webhook/certs/key.pem            - -alsologtostderr            - -v=4            - 2>&1          volumeMounts:            - name: webhook-certs              mountPath: /etc/webhook/certs              readOnly: true      volumes:        - name: webhook-certs          secret:            secretName: lxcfs-admission-webhook-certs
2.5 svc.yaml
apiVersion: v1kind: Servicemetadata:  namespace: lxcfs  name: lxcfs-admission-webhook-svc  labels:    app: lxcfs-admission-webhookspec:  ports:  - port: 443    targetPort: 443  selector:    app: lxcfs-admission-webhook
3.验证,应用注解能力

default namespace 开启lxcfs能力

kubectl label namespace default lxcfs-admission-webhook=enabled

部署deployment

cd lxcfs-admission-webhookkubectl apply -f deployment/web.yaml

登录容器执行free

$ kubectl get podNAME                                                 READY   STATUS    RESTARTS   AGElxcfs-admission-webhook-deployment-f4bdd6f66-5wrlg   1/1     Running   0          8m29slxcfs-pqs2d                                          1/1     Running   0          55mlxcfs-zfh99                                          1/1     Running   0          55mweb-7c5464f6b9-6zxdf                                 1/1     Running   0          8m10sweb-7c5464f6b9-nktff                                 1/1     Running   0          8m10s$ kubectl exec -ti web-7c5464f6b9-6zxdf sh# free             total       used       free     shared    buffers     cachedMem:        262144       2744     259400          0          0        312-/+ buffers/cache:       2432     259712Swap:            0          0          0#
总结

这里强调一下,我们实现的是容器资源视图和物理机资源视图的隔离,而非pod的。

容器资源视图隔离后,视觉上舒服很多,对定位问题,服务启动,网络安全上都有很大帮助,行动起来吧。欢迎关注DevOpSec每周分享干货内容,我们一起进步。

标签: #ubuntu更新uvc