龙空技术网

一次kubesphere集群网络故障的解决

云原生学习班 435

前言:

而今小伙伴们对“kubesphere 日志”都比较注重,同学们都需要了解一些“kubesphere 日志”的相关知识。那么小编也在网上网罗了一些关于“kubesphere 日志””的相关内容,希望朋友们能喜欢,你们一起来了解一下吧!

前言

Kubesphere 是青云公司开源的一款 Kubernetes 发行版,有比较漂亮的 kubernetes 集群管理界面,我们用 Kubesphere 来作为开发平台。

本文记录了一次 kubesphere 环境下的网络故障的解决过程。

现象

开发同学反馈自己搭建的 harbor 仓库总是出问题,有时候可以 pull 镜像成功,有时候报 net/http: TLS handshake timeout , 通过 curl 的方式访问 harbor.xxx.cn ,也会随机频繁挂起。但是 ping 反馈一切正常。

原因分析

接到错误报障后,经过了多轮分析,才最终定位到原因,应该是安装 kubesphere 时,使用了最新版的 kubernetes 1.23.1 。

虽然使用 ./kk version --show-supported-k8s 可以看到 kubesphere 3.2.1 可以支持 kubernetes 1.23.1 ,但实际上只是试验性支持,有坑的。

基本上分析过程如下:

出现 harbor registry 访问问题,下意识以为是 harbor 部署有问题,但是在检查 harbor core 的日志的时候,没有看到异常时有相应错误信息,甚至 info 级别的日志信息都没有。又把目标放在 harbor portal , 查看访问日志,一样没有发现异常信息.根据访问链,继续追查 kubesphere-router-kubesphere-system , 即 kubesphere 版的 nginx ingress controller ,同样没有发现异常日志。尝试在集群内其他 pod 里访问 harbor 的集群内 service 地址,发现不会出现访问超时的问题。初步判断是 kubesphere 自带的 ingress 的问题。把 kubesphere 自带的 ingress controller 关闭,安装 kubernetes 官方推荐的 ingress-nginx-controller 版本, 故障依旧,而且 ingress 日志里也没有发现异常信息。综合上面的分析,问题应该出现在客户端到 ingress controller 之间,我的 ingress controller 是通过 nodeport 方式暴露到集群外面。因此,测试其他通过 nodePort 暴露到集群外的 service,发现是一样的故障,至此,可以完全排除 harbor 部署问题了,基本确定是客户端到 ingress controller 的问题。外部客户端通过 nodePort 访问 ingress controller 时,会通过 kube-proxy 组件,分析 kube-proxy 的日志,发现告警信息

can’t set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1

这个告警信息是因为我的 centos 7.6 的内核版本过低, 当前是 3.10.0-1160.21.1.el7.x86_64 ,与 kubernetes 新版的 ipvs 存在兼容性问题。

可以通过升级操作系统的 kernel 版本可以解决。

升级完 kernel 后,calico 启动不了,报以下错误信息

ipset v7.1: kernel and userspace incompatible: settype hash:ip,port with revision 6 not supported by userspace.

原因是安装 kubesphere 时默认安装的 calico 版本是 v3.20.0 , 这个版本不支持最新版的 linux kernel ,升级后的内核版本是 5.18.1-1.el7.elrepo.x86_64,calico 需要升级到 v3.23.0 以上版本。

升级完 calico 版本后,calico 继续报错

user "system:serviceaccount:kube-system:calico-node" cannot list resource "caliconodestatuses" in api group "crd.projectcalico.org"

还有另外一个错误信息,都是因为 clusterrole 的资源权限不足,可以通过修改 clusterrole 来解决问题。

至此,该莫名其妙的网络问题解决了。解决过程

根据上面的分析,主要解决方案如下:

升级操作系统内核使用阿里云的 yum 源

wget -O /etc/yum.repos.d/CentOS-Base.repo  clean all && yum -y update
启用 elrepo 仓库
rpm --import  -Uvh 
安装最新版本内核
yum --enablerepo=elrepo-kernel install kernel-ml
查看系统上的所有可用内核
awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
设置新的内核为 grub2 的默认版本

查看第4步返回的系统可用内核列表,不出意外第1个应该是最新安装的内核。

grub2-set-default 0
生成 grub 配置文件并重启
grub2-mkconfig -o /boot/grub2/grub.cfgreboot now
验证
uname -r
升级 calico

kubernetes 上的 calico 一般是使用 daemonset 方式部署,我的集群里,calico 的 daemonset 名字是 calico-node。

直接输出为 yaml 文件,修改文件里的所有 image 版本号为最新版本 v3.23.1 。重新创建 daemonset。

输出 yaml

kubectl -n kube-system get ds  calico-node -o yaml>calico-node.yaml
calico-node.yaml:
apiVersion: apps/v1kind: DaemonSetmetadata:  labels:    k8s-app: calico-node  name: calico-node  namespace: kube-systemspec:  revisionHistoryLimit: 10  selector:    matchLabels:      k8s-app: calico-node  template:    metadata:      creationTimestamp: null      labels:        k8s-app: calico-node    spec:      containers:      - env:        - name: DATASTORE_TYPE          value: kubernetes        - name: WAIT_FOR_DATASTORE          value: "true"        - name: NODENAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CALICO_NETWORKING_BACKEND          valueFrom:            configMapKeyRef:              key: calico_backend              name: calico-config        - name: CLUSTER_TYPE          value: k8s,bgp        - name: NODEIP          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: status.hostIP        - name: IP_AUTODETECTION_METHOD          value: can-reach=$(NODEIP)        - name: IP          value: autodetect        - name: CALICO_IPV4POOL_IPIP          value: Always        - name: CALICO_IPV4POOL_VXLAN          value: Never        - name: FELIX_IPINIPMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: FELIX_VXLANMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: FELIX_WIREGUARDMTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: CALICO_IPV4POOL_CIDR          value: 10.233.64.0/18        - name: CALICO_IPV4POOL_BLOCK_SIZE          value: "24"        - name: CALICO_DISABLE_FILE_LOGGING          value: "true"        - name: FELIX_DEFAULTENDPOINTTOHOSTACTION          value: ACCEPT        - name: FELIX_IPV6SUPPORT          value: "false"        - name: FELIX_HEALTHENABLED          value: "true"        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/node:v3.23.1        imagePullPolicy: IfNotPresent        livenessProbe:          exec:            command:            - /bin/calico-node            - -felix-live            - -bird-live          failureThreshold: 6          initialDelaySeconds: 10          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10        name: calico-node        readinessProbe:          exec:            command:            - /bin/calico-node            - -felix-ready            - -bird-ready          failureThreshold: 3          periodSeconds: 10          successThreshold: 1          timeoutSeconds: 10        resources:          requests:            cpu: 250m        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/etc/cni/net.d          name: cni-net-dir        - mountPath: /lib/modules          name: lib-modules          readOnly: true        - mountPath: /run/xtables.lock          name: xtables-lock        - mountPath: /var/run/calico          name: var-run-calico        - mountPath: /var/lib/calico          name: var-lib-calico        - mountPath: /var/run/nodeagent          name: policysync        - mountPath: /sys/fs/          mountPropagation: Bidirectional          name: sysfs        - mountPath: /var/log/calico/cni          name: cni-log-dir          readOnly: true      dnsPolicy: ClusterFirst      hostNetwork: true      initContainers:      - command:        - /opt/cni/bin/calico-ipam        - -upgrade        env:        - name: KUBERNETES_NODE_NAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CALICO_NETWORKING_BACKEND          valueFrom:            configMapKeyRef:              key: calico_backend              name: calico-config        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/cni:v3.23.1        imagePullPolicy: IfNotPresent        name: upgrade-ipam        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /var/lib/cni/networks          name: host-local-net-dir        - mountPath: /host/opt/cni/bin          name: cni-bin-dir      - command:        - /opt/cni/bin/install        env:        - name: CNI_CONF_NAME          value: 10-calico.conflist        - name: CNI_NETWORK_CONFIG          valueFrom:            configMapKeyRef:              key: cni_network_config              name: calico-config        - name: KUBERNETES_NODE_NAME          valueFrom:            fieldRef:              apiVersion: v1              fieldPath: spec.nodeName        - name: CNI_MTU          valueFrom:            configMapKeyRef:              key: veth_mtu              name: calico-config        - name: SLEEP          value: "false"        envFrom:        - configMapRef:            name: kubernetes-services-endpoint            optional: true        image: calico/cni:v3.23.1        imagePullPolicy: IfNotPresent        name: install-cni        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/opt/cni/bin          name: cni-bin-dir        - mountPath: /host/etc/cni/net.d          name: cni-net-dir      - image: calico/pod2daemon-flexvol:v3.23.1        imagePullPolicy: IfNotPresent        name: flexvol-driver        resources: {}        securityContext:          privileged: true        terminationMessagePath: /dev/termination-log        terminationMessagePolicy: File        volumeMounts:        - mountPath: /host/driver          name: flexvol-driver-host      nodeSelector:        kubernetes.io/os: linux      priorityClassName: system-node-critical      restartPolicy: Always      schedulerName: default-scheduler      securityContext: {}      serviceAccount: calico-node      serviceAccountName: calico-node      terminationGracePeriodSeconds: 0      tolerations:      - effect: NoSchedule        operator: Exists      - key: CriticalAddonsOnly        operator: Exists      - effect: NoExecute        operator: Exists      volumes:      - hostPath:          path: /lib/modules          type: ""        name: lib-modules      - hostPath:          path: /var/run/calico          type: ""        name: var-run-calico      - hostPath:          path: /var/lib/calico          type: ""        name: var-lib-calico      - hostPath:          path: /run/xtables.lock          type: FileOrCreate        name: xtables-lock      - hostPath:          path: /sys/fs/          type: DirectoryOrCreate        name: sysfs      - hostPath:          path: /opt/cni/bin          type: ""        name: cni-bin-dir      - hostPath:          path: /etc/cni/net.d          type: ""        name: cni-net-dir      - hostPath:          path: /var/log/calico/cni          type: ""        name: cni-log-dir      - hostPath:          path: /var/lib/cni/networks          type: ""        name: host-local-net-dir      - hostPath:          path: /var/run/nodeagent          type: DirectoryOrCreate        name: policysync      - hostPath:          path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds          type: DirectoryOrCreate        name: flexvol-driver-host  updateStrategy:    rollingUpdate:      maxSurge: 0      maxUnavailable: 1    type: RollingUpdate
clusterrole

还需要修改 clusterrole ,否则 calico 会一直报权限错。

输出 yaml

kubectl get clusterrole calico-node -o yaml >calico-node-clusterrole.yaml
calico-node-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata:  name: calico-noderules:- apiGroups:  - ""  resources:  - pods  - nodes  - namespaces  verbs:  - get- apiGroups:  - discovery.k8s.io  resources:  - endpointslices  verbs:  - watch  - list- apiGroups:  - ""  resources:  - endpoints  - services  verbs:  - watch  - list  - get- apiGroups:  - ""  resources:  - configmaps  verbs:  - get- apiGroups:  - ""  resources:  - nodes/status  verbs:  - patch  - update- apiGroups:  - networking.k8s.io  resources:  - networkpolicies  verbs:  - watch  - list- apiGroups:  - ""  resources:  - pods  - namespaces  - serviceaccounts  verbs:  - list  - watch- apiGroups:  - ""  resources:  - pods/status  verbs:  - patch- apiGroups:  - crd.projectcalico.org  resources:  - globalfelixconfigs  - felixconfigurations  - bgppeers  - globalbgpconfigs  - bgpconfigurations  - ippools  - ipamblocks  - globalnetworkpolicies  - globalnetworksets  - networkpolicies  - networksets  - clusterinformations  - hostendpoints  - blockaffinities  - caliconodestatuses  - ipreservations  verbs:  - get  - list  - watch- apiGroups:  - crd.projectcalico.org  resources:  - ippools  - felixconfigurations  - clusterinformations  verbs:  - create  - update- apiGroups:  - ""  resources:  - nodes  verbs:  - get  - list  - watch- apiGroups:  - crd.projectcalico.org  resources:  - bgpconfigurations  - bgppeers  verbs:  - create  - update- apiGroups:  - crd.projectcalico.org  resources:  - blockaffinities  - ipamblocks  - ipamhandles  verbs:  - get  - list  - create  - update  - delete- apiGroups:  - crd.projectcalico.org  resources:  - ipamconfigs  verbs:  - get- apiGroups:  - crd.projectcalico.org  resources:  - blockaffinities  verbs:  - watch- apiGroups:  - apps  resources:  - daemonsets  verbs:  - get
总结

这次奇怪的网络故障,最终原因还是因为 kubesphere 的版本与 kubernetes 的版本不匹配。所以工作环境要稳字为先,不要冒进使用最新的版本。否则会耽搁很多时间来解决莫名其妙的问题。

标签: #kubesphere 日志