Kubernetes 使用Rook-Ceph作为持久化存储PV

我们如何使用Rook-Ceph作为k8s的PV呢？
目前Ceph支持对象存储，CephFS(文件存储),和块存储。下面我将展示CephFS的方法。
在此之前，需要注意是，Ceph是分布式存储，如果你是3个节点，3个第二个硬盘。他是通过占用这3个硬盘，然后当成一个池子(Pool)来使用，他会通过内部算法来分片。虽然我们可以手动设置并调度Pod到指定的节点，例如区分可用区。
但是我们只是调度了Pod,数据池是三合一成为一体的。即使有纠错和算法，当只有一个硬盘坏的情况下，依然有概率导致数据不可读。应该从更底层例如RAID的方式来实现跨可用区高可用。而如果是使用云运营商的云盘，这自然是不可能的。
或者使用商业的方案，例如每个硬盘是一个池子，可以支持根据标签来多副本。

前提条件

我们使用Rook-Ceph作为Ceph在k8s的编排方式。在此之前，先确认前提条件。
https://rook.io/docs/rook/v1.10/Getting-Started/Prerequisites/prerequisites/#ceph-prerequisites

FSTYPE硬盘未格式化

$ lsblk -f
NAME                  FSTYPE      LABEL UUID                                   MOUNTPOINT
vda
└─vda1                LVM2_member       >eSO50t-GkUV-YKTH-WsGq-hNJY-eKNf-3i07IB
  ├─ubuntu--vg-root   ext4              c2366f76-6e21-4f10-a8f3-6776212e2fe4   /
  └─ubuntu--vg-swap_1 swap              9492a3dc-ad75-47cd-9596-678e8cf17ff9   [SWAP]
vdb

在FSTYPE中，这块硬盘必须未格式化（也就是显示是空的）。如果第二块云盘默认已添加，则一般已经格式化了。需要改回来。
改的方法也比较简单，我们使用fdisk

fdisk /dev/vdb

# 然后出现交互询问时，输入d (删除分区)
>> d
# 然后输入w保存
>> w
$ lsblk -f

Admission Controller

添加Rook Admission Controller

1	kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.7.1/cert-manager.yaml

LVM package 更新

1	sudo yum install -y lvm2

安装Rook-Ceph

获取GitHub中的示范yaml文件

不要使用Helm以及value.yaml进行部署。

$ git clone https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster.yaml

基础部署

因为我们使用CephFS, 所以我们需要/csi/cephfs/下的storageclass。但是在此之前，还需要/examples下的filesystem。

kubectl create -f filesystem.yaml

# 或者纠错版
kubectl create -f filesystem-ec.yaml

使用/root/rook/deploy/examples/csi/cephfs/下的文件部署

cd /root/rook/deploy/examples/csi/cephfs/
kubectl create -f storageclass.yaml
# 或者纠错版
kubectl create -f storageclass-ec.yaml

StorageClass 存储类检查

kubectl get sc
# kubectl get storageclass

root@iZ0xi8e6m9i2dxn2mfu8tzZ:~# kubectl get sc
NAME                             PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
alibabacloud-cnfs-nas            nasplugin.csi.alibabacloud.com    Delete          Immediate              true                   9d
alicloud-disk-efficiency         diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-essd               diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-ssd                diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-topology-alltype   diskplugin.csi.alibabacloud.com   Delete          WaitForFirstConsumer   true                   9d
rook-cephfs           rook-ceph.cephfs.csi.ceph.com     Delete          Immediate              true                   9d

从这里可以看出，并没有默认使用rook-cephfs,我们使用以下命令


# kubectl patch storageclass <storageClassName> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
root@k8s-manage:~# kubectl patch storageclass rook-cephfs -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

root@k8s-manage:~# kubectl get sc
NAME                             PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
alibabacloud-cnfs-nas            nasplugin.csi.alibabacloud.com    Delete          Immediate              true                   9d
alicloud-disk-efficiency         diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-essd               diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-ssd                diskplugin.csi.alibabacloud.com   Delete          Immediate              true                   9d
alicloud-disk-topology-alltype   diskplugin.csi.alibabacloud.com   Delete          WaitForFirstConsumer   true                   9d
rook-cephfs (default)            rook-ceph.cephfs.csi.ceph.com     Delete          Immediate              true                   9d

PVC 部署样例：

1	kubectl create -f kube-registry.yaml

部署完成后，你可以通过以下命令查询

1 2	kubectl get pv kubectl get pvc

如果你使用阿里云，也可以通过Web UI在存储卷和存储声明找到。

Ceph Toolbox

安装Toolbox 方便登录进去查看ceph的状态。

1	kubectl create -f deploy/examples/toolbox.yaml

其他基础命令：

ceph status
ceph osd status
ceph df
rados df

网页版管理 Ceph Dashboard

本身默认就是启用的状态
你可以通过以下方式查看。

root@k8s-ceph:~# kubectl get svc -n rook-ceph
NAME                         TYPE           CLUSTER-IP        EXTERNAL-IP      PORT(S)             AGE
rook-ceph-exporter           ClusterIP      192.168.36.5      <none>           9926/TCP            12d
rook-ceph-mgr                ClusterIP      192.168.44.127    <none>           9283/TCP            12d
rook-ceph-mgr-dashboard      ClusterIP      192.168.173.180   <none>           8443/TCP            8m7s
rook-ceph-mon-a              ClusterIP      192.168.196.149   <none>           6789/TCP,3300/TCP   19h
rook-ceph-mon-c              ClusterIP      192.168.116.46    <none>           6789/TCP,3300/TCP   12d
rook-ceph-mon-d              ClusterIP      192.168.226.212   <none>           6789/TCP,3300/TCP   23h

我为了方便加了个LB

root@k8s-ceph:~# kubectl get svc -n rook-ceph
NAME                         TYPE           CLUSTER-IP        EXTERNAL-IP      PORT(S)             AGE
rook-ceph-exporter           ClusterIP      192.168.36.5      <none>           9926/TCP            12d
rook-ceph-mgr                ClusterIP      192.168.44.127    <none>           9283/TCP            12d
rook-ceph-mgr-dashboard      ClusterIP      192.168.173.180   <none>           8443/TCP            8m7s
rook-ceph-mgr-dashboard-lb   LoadBalancer   192.168.59.173    103.22.13.134   8443:31474/TCP      21m
rook-ceph-mon-a              ClusterIP      192.168.196.149   <none>           6789/TCP,3300/TCP   19h
rook-ceph-mon-c              ClusterIP      192.168.116.46    <none>           6789/TCP,3300/TCP   12d
rook-ceph-mon-d              ClusterIP      192.168.226.212   <none>           6789/TCP,3300/TCP   23h

用户名为admin, 密码通过以下方式获取：

1	kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" \| base64 --decode && echo

提示：访问网页版默认是自签名HTTPS。

编辑存储池副本数

我们可以通过编辑存储系统的副本数来增加冗余，也可以通过这个方式将PG出现报错的情况下，重新均衡分片。
我们先从网页上看，默认副本数是3。
我们先通过toolbox登录并修改副本数。

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash

# 登录查看pool列表, 可以看出默认是3个数据池。
bash-5.1$ ceph osd pool ls
.mgr
myfs-metadata
myfs-replicated

# 然后修改副本数为4
bash-5.1$ ceph osd pool set myfs-replicated size 4

测试应用1.1 nginx deployment

我通过增加一个自己的测试应用来看看是否成功。以下是一个nginx应用。
关于kind: PersistentVolumeClaim这一部分，你也可以分开成不同的yaml文件。

1	kubectl apply -f nginx-with-pvc.yaml

nginx-with-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-pvc-nginx
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
  storageClassName: rook-cephfs
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  generation: 1
  labels:
    app: nginx
  name: nginx
  namespace: default

spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        volumeMounts:
            - name: nginx-html
              mountPath: /usr/share/nginx/html/
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File  
      volumes:
      - name: nginx-html
        persistentVolumeClaim:
            claimName: cephfs-pvc-nginx
            readOnly: false                  
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

测试应用1.2 nginx 使用configMap 加载nginx.conf

如果我们kubectl exec -it /bin/bash进入后，通过ls /etc/nginx/后是可以看到nginx.conf这个文件的。但是这个文件是在镜像内，那么我们怎么加载自定义修改过的nginx.conf呢？
答案就是利用configMap。
首先，我们先建一个nginx-configMap.yaml。这个文件是基于cat /etc/nginx/nginx.conf的结果。同时我们需要将nginx.conf内的内容额外增加空格(Tab 2次)，从而满足yaml的格式化要求。
起床，我们还在文件内增加This is Test,方便验证这个configMap是否加载成功。

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: default
data:
  nginx.conf: |+
    user  nginx;
    worker_processes  auto;

    error_log  /var/log/nginx/error.log notice;
    pid        /var/run/nginx.pid;


    events {
        worker_connections  1024;
    }


    http {
        include       /etc/nginx/mime.types;
        default_type  application/octet-stream;

        log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for"';

        access_log  /var/log/nginx/access.log  main;

        sendfile        on;
        #tcp_nopush     on;

        keepalive_timeout  65;

        #gzip  on;
        # This is Testing
        include /etc/nginx/conf.d/*.conf;
    }

然后，我们执行

1	kubectl apply -f nginx-configMap.yaml

执行完Config Map后，我们修改deployment.yaml。需要找到spec.spec.containers下的volumeMounts和volumes增加config map的配置。结果如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  generation: 1
  labels:
    app: nginx
  name: nginx
  namespace: default

spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        volumeMounts:
            - name: nginx-html
              mountPath: /usr/share/nginx/html/
              #mountPath: /mnt/data
            - name: nginx-config-vol
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      volumes:
      - name: nginx-html
        persistentVolumeClaim:
            claimName: cephfs-pvc-nginx
            readOnly: false
      - name: nginx-config-vol
        configMap:
          #defaultMode: 420
          name: nginx-config    
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

之后，我们就可以通过进入pod内，查看这个文件是否多了This is Test即可。

测试应用1.3 nginx configMap加载conf.d下的配置文件

nginx.conf基本上是默认配置，改动的概率比较小，反而是/etc/nginx/conf.d/下在每个站点应该新增xxx.xxx.com.conf的文件
操作方法类似，你也可以将configMap直接合并成一个，唯一的差异是这次没有subPath。

nginx-configMap-Server.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config-server
  namespace: default
data:
  server1.conf: |+
        server {
            listen       80;
            listen  [::]:80;
            server_name  localhost;

            #access_log  /var/log/nginx/host.access.log  main;
            # This is test - server confd
            location / {
                root   /usr/share/nginx/html;
                index  index.html index.htm;
            }

            #error_page  404              /404.html;

            # redirect server error pages to the static page /50x.html
            #
            error_page   500 502 503 504  /50x.html;
            location = /50x.html {
                root   /usr/share/nginx/html;
            }

            # proxy the PHP scripts to Apache listening on 127.0.0.1:80
            #
            #location ~ \.php$ {
            #    proxy_pass   http://127.0.0.1;
            #}

            # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
            #
            #location ~ \.php$ {
            #    root           html;
            #    fastcgi_pass   127.0.0.1:9000;
            #    fastcgi_index  index.php;
            #    fastcgi_param  SCRIPT_FILENAME  /scripts$fastcgi_script_name;
            #    include        fastcgi_params;
            #}

            # deny access to .htaccess files, if Apache's document root
            # concurs with nginx's one
            #
            #location ~ /\.ht {
            #    deny  all;
            #}
        }

nginx-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  generation: 1
  labels:
    app: nginx
  name: nginx
  namespace: default

spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nginx
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        volumeMounts:
            - name: nginx-html
              mountPath: /usr/share/nginx/html/
            - name: nginx-config-vol
              mountPath: /etc/nginx/nginx.conf
              subPath: nginx.conf
            - name: nginx-config-server-vol
              mountPath: /etc/nginx/conf.d/
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      volumes:
      - name: nginx-html
        persistentVolumeClaim:
            claimName: cephfs-pvc-nginx
            readOnly: false
      - name: nginx-config-vol
        configMap:
          #defaultMode: 420
          name: nginx-config    
      - name: nginx-config-server-vol
        configMap:
          name: nginx-config-server
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

测试应用1.4 nginx 服务(Services) Expose暴露端口

最简单暴露端口的方式是NodePort：

apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
    - port: 80
      targetPort: 80

我们可以通过

1	kubectl get svc

得知目前有哪些应用对外提供服务。

如果是使用云产品，建议使用LoadBalancer模式，且建议直接在网页上配置。记得要有selector选择器。 (app: nginx)

测试应用2 alpine

我们可以将操作系统放进去，从而看到挂载的文件什么样。我们同样使用cephfs-pvc-nginx作为PV(为了使用vi/cat)
因为此前已经有PersistentVolumeClaim,所以我们直接如下即可。

alpine-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alpine-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: alpine
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%      
  template:
    metadata:
      labels:
        app: alpine
    spec:
      containers:
      - name: alpine
        image: alpine:latest
        # 这个命令是标准测试log的命令
        # command: ["sh", "-c", "while true; do echo hello; sleep 10; done"]
        # 我这里把命令拆分成command和args, 并且命令是在/mnt/alpine/datetime.log 每15秒记录时间+hello
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo \"$(date): hello\" >> /mnt/alpine/datetime.log; sleep 15; done"]
        # command: ["sh", "-c", "while true; do echo \"$(date): hello\" >> /mnt/alpine/datetime.log; sleep 15; done"]
        volumeMounts:
        - name: dataval1
          mountPath: /mnt/alpine
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"
      volumes:
      - name: dataval1
        persistentVolumeClaim:
          claimName: cephfs-pvc-nginx

如果没有PVC,则可以新建一个yaml文件, 比如命名为pvc-nginx。这样之后增加大小就可以直接通过kubectl apply -f pvc-nginx.yaml来执行了。

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-pvc-nginx
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
  storageClassName: rook-cephfs

我们先kubectl get pods获得pod的名字，然后进入到sh里面。

1	kubectl exec -it alpine-deployment-59b86bb64-hndhq -- sh

比如我们touch 1.txt, 或者vi 1.txt
就可以看到/mnt/alpine下多出了1.txt

如果我们更早之前的测试应用nginx是如下：

apiVersion: v1
kind: Pod
metadata:
  name: nginx-bash-pod
spec:
  restartPolicy: OnFailure
  containers:
#  - resources:
#      limits:
#        cpu: 6
#      requests:
#        cpu: 6
  - command: ["/bin/bash"]
    args: ["-c", "sleep 365d; exit 0"]
    image: nginx
    imagePullPolicy: IfNotPresent
    name: nginx-bash-pod
    volumeMounts:
    - name: dataval0
      mountPath: /mnt/data
  volumes:
  - name: dataval0
    persistentVolumeClaim:
      claimName: cephfs-pvc-nginx

那么我们也可以通过以下方式进入，但是这个是只能touch命令。比如touch 2.txt

1	kubectl exec -it nginx-bash-pod bash

其实普通的nginx deployment 也可以通过以下命令进入命令行模式

1	kubectl exec -it nginx-deployment-pod01 -- /bin/bash

因为CephFS是共享型，所以，即使我们是挂载/mnt/data 还是/mnt/alpine 实际上在CephFS这个数据池子Pool内，你会看到1.txt和2.txt。

验证

除去通过不同应用使用vi, touch的方式读写，也可以通过ls -l, df -h的方式查看是否已经挂载。
我们还可以根据之前的测试应用nginx, 修改server1.conf的Config Map配置项，将默认的/usr/share/nginx/html修改为/usr/share/nginx/html/cfswww。
同时利用我们的alpine系统创建cfswww文件夹并将CephFS内index.html和50x.html移动到cfswww文件夹内。
我们可以对index.html做一些改动，比如增加Test 1等h1标题。
修改完之后，如果没有生效，我们可以重启应用。

1	kubectl rollout restart deployment <deployment-name> -n <namespace>

之后我们就可以直接将A可用区机器全部停止，然后查看服务是否不受影响。
我们发现AZ A 2节点；AZ B 1节点的情况下，只有1个B节点会导致pod卡死在ContainerCreating。
我们通过describe发现报错是由于pvc无法挂载：

kubectl describe pod alpine-deployment-6fc7446598-982mr

... ...
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/alpine-deployment-6fc7446598-982mr to us-east-1.10.0.3.242
  Warning  FailedMount  78s (x4 over 9m26s)  kubelet            MountVolume.SetUp failed for volume "pvc-ed5a9231-9d49-4091-9c7b-a3b3924f5f67" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

当我们启动其中一个在A可用区节点后，没多久就可以正常挂载pvc并正常对外服务。
友情提示：如果你强行挂掉一个节点，之后删除这个节点。部分调度依然会尝试调度到那个被删除的节点。即使你describe pod -o yaml转成yaml格式，或者

1 2	kubectl get rs -n <namespace> kubectl get deploy -n <namespace>

后尝试describe 并转成yaml去apply都是无效的。
他会自动补充mds节点，你删除掉对应deploy即可。

跨可用区高可用

我们可以通过配置拓扑分布约束(Topology Spread Constraints)来实现跨可用区高可用。当然，你也可以配置Pod反亲和(Pod Anti-Affinity）来实现。

阿里云ACK官方文档 - 节点池高可用配置之拓扑分布约束
 k8s关于maxSkew解释

#前面一堆东西，在spec.template.spec下
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule #只能选DoNotSchedule, ScheduleAnyway是随机算法更高一级。

查看pod的node位置可以通过describe pod来查看，但是一个一个太麻烦，可以通过如下方式：

1
2
3

kubectl get pods -o custom-columns='NAME:.metadata.name,NODE:.spec.nodeName'
# 查看rook-ceph命名空间
kubectl get pods -n rook-ceph -o custom-columns='NAME:.metadata.name,NODE:.spec.nodeName'

针对其中已经运行的pod, 其实不需要通过kubectl edit deploy -n rook-ceph去编辑或者转成yaml去apply，直接patch即可。例如csi-cephfsplugin-provisioner默认只有两个：

kubectl patch deployment csi-cephfsplugin-provisioner --patch '{
  "spec": {
    "template": {
      "spec": {
        "topologySpreadConstraints": [
          {
            "maxSkew": 1,
            "topologyKey": "topology.kubernetes.io/zone",
            "whenUnsatisfiable": "DoNotSchedule"
          }
        ]
      }
    }
  }
}'

实践中感觉使用Pod Anti-Affinity更有效。

Pod Anti-Affinity的实现方式：
阿里云ACK官方文档 - 配置Pod反亲和
 Kubernetes Multi AZ deployments using pod anti-affinity

如果你参考英文版跨可用区方式，他的failure-domain.beta.kubernetes.io/zone 已经在1.17之后被弃用。改为topology.kubernetes.io/zone

所以请参考ACK官方文档以及英文版Soft Pod Anti-Affinity合并一下。
也就是把requiredDuringSchedulingIgnoredDuringExecution改为preferredDuringSchedulingIgnoredDuringExecution

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-run-per-node # app的名字
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-run-per-node # app的名字
  template:
    metadata:
      labels:
        app: app-run-per-node # app的名字
    spec:
      containers:
        - name: app-container
          image: app-image
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - app-run-per-node # app的名字
              topologyKey: "topology.kubernetes.io/zone"
            weight: 100

但是MDS是每个Pod一个Deployment, 你是无法通过这个方式跨可用区的。这个时候你可以利用nodeAffinity来实现只调度到某个可用区。(ceph可以部署前把每个yaml看一遍。好像是可以高可用的)
K8S Docs - Assign Pod to Node

1	root@k8s-ceph:~# kubectl edit deploy rook-ceph-mds-myfs-a -n rook-ceph

你可以看到这个deployment在spec下的template.spec,有个affinity，他下面是podAntiAffinity，就是禁止调度到同一个节点。我们在此额外添加nodeAffinity即可。

   spec:
     affinity:
############# 添加开始 #############         
       nodeAffinity:
         preferredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: topology.kubernetes.io/zone
               operator: In
               values:
               - us-east-1a
############# 添加结束 #############               
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchExpressions:
             - key: app
               operator: In
               values:
               - rook-ceph-mds
           topologyKey: kubernetes.io/hostname