k8s奪命的5秒DNS延遲

超時問題
客戶反饋從pod中訪問服務時，總是有些請求的響應時延會達到5秒。正常的響應只需要毫秒級別的時延。
 
DNS 5秒延時
在pod中(通過nsenter -n tcpdump)抓包，發現是有的DNS請求沒有收到響應，超時5秒后，再次發送DNS請求才成功收到響應。
 
在kube-dns pod抓包，發現是有DNS請求沒有到達kube-dns pod， 在中途被丟棄了。
 
為什么是5秒？ man resolv.conf可以看到glibc的resolver的缺省超時時間是5s。
 
丟包原因
經過搜索發現這是一個普遍問題。
根本原因是內核conntrack模塊的bug。
 
Weave works的工程師Martynas Pumputis對這個問題做了很詳細的分析：
https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts
 
相關結論：
 
只有多個線程或進程，并發從同一個socket發送相同五元組的UDP報文時，才有一定概率會發生
glibc, musl(alpine linux的libc庫)都使用”parallel query”, 就是并發發出多個查詢請求，因此很容易碰到這樣的沖突，造成查詢請求被丟棄
由于ipvs也使用了conntrack, 使用kube-proxy的ipvs模式，并不能避免這個問題
問題的根本解決
Martynas向內核提交了兩個patch來fix這個問題，不過他說如果集群中有多個DNS server的情況下，問題并沒有完全解決。
 
其中一個patch已經在2018-7-18被合并到linux內核主線中: netfilter: nf_conntrack: resolve clash for matching conntracks
 
目前只有4.19.rc 版本包含這個patch。
 
規避辦法
規避方案一：使用TCP發送DNS請求
由于TCP沒有這個問題，有人提出可以在容器的resolv.conf中增加options use-vc, 強制glibc使用TCP協議發送DNS query。下面是這個man resolv.conf中關于這個選項的說明：
 
use-vc (since glibc 2.14)
                     Sets RES_USEVC in _res.options.  This option forces the
                     use of TCP for DNS resolutions.
復制
筆者使用鏡像”busybox:1.29.3-glibc” (libc 2.24) 做了試驗，并沒有見到這樣的效果，容器仍然是通過UDP發送DNS請求。
 
規避方案二：避免相同五元組DNS請求的并發
resolv.conf還有另外兩個相關的參數：
 
single-request-reopen (since glibc 2.9)
single-request (since glibc 2.10)
man resolv.conf中解釋如下：
 
single-request-reopen (since glibc 2.9)
                     Sets RES_SNGLKUPREOP in _res.options.  The resolver
                     uses the same socket for the A and AAAA requests.  Some
                     hardware mistakenly sends back only one reply.  When
                     that happens the client system will sit and wait for
                     the second reply.  Turning this option on changes this
                     behavior so that if two requests from the same port are
                     not handled correctly it will close the socket and open
                     a new one before sending the second request.
                      
single-request (since glibc 2.10)
                     Sets RES_SNGLKUP in _res.options.  By default, glibc
                     performs IPv4 and IPv6 lookups in parallel since
                     version 2.9.  Some appliance DNS servers cannot handle
                     these queries properly and make the requests time out.
                     This option disables the behavior and makes glibc
                     perform the IPv6 and IPv4 requests sequentially (at the
                     cost of some slowdown of the resolving process).
復制
筆者做了試驗，發現效果是這樣的：
 
single-request-reopen
發送A類型請求和AAAA類型請求使用不同的源端口。這樣兩個請求在conntrack表中不占用同一個表項，從而避免沖突。
single-request
避免并發，改為串行發送A類型和AAAA類型請求。沒有了并發，從而也避免了沖突。
要給容器的resolv.conf加上options參數，有幾個辦法：
 
1) 在容器的”ENTRYPOINT”或者”CMD”腳本中，執行/bin/echo 'options single-request-reopen' >> /etc/resolv.conf
2) 在pod的postStart hook中：
lifecycle:
  postStart:
    exec:
      command:
      - /bin/sh
      - -c 
      - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"
復制
3) 使用template.spec.dnsConfig (k8s v1.9 及以上才支持):
template:
  spec:
    dnsConfig:
      options:
        - name: single-request-reopen
復制
4) 使用ConfigMap覆蓋POD里面的/etc/resolv.conf
configmap:
 
apiVersion: v1
data:
  resolv.conf: |
    nameserver 1.2.3.4
    search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
    options ndots:5 single-request-reopen timeout:1
kind: ConfigMap
metadata:
  name: resolvconf
復制
POD spec:
 
        volumeMounts:
        - name: resolv-conf
          mountPath: /etc/resolv.conf
          subPath: resolv.conf
...
 
      volumes:
      - name: resolv-conf
        configMap:
          name: resolvconf
          items:
          - key: resolv.conf
            path: resolv.conf
復制
5) 使用MutatingAdmissionWebhook
MutatingAdmissionWebhook 是1.9引入的Controller，用于對一個指定的Resource的操作之前，對這個resource進行變更。
istio的自動sidecar注入就是用這個功能來實現的。 我們也可以通過MutatingAdmissionWebhook，來自動給所有POD，注入以上3)或者4)所需要的相關內容。
 
以上方法中， 1)和2)都需要修改鏡像， 3)和4)則只需要修改POD的spec， 能適用于所有鏡像。不過還是有不方便的地方：
 
每個工作負載的yaml都要做修改，比較麻煩
對于通過helm創建的工作負載，需要修改helm charts
方法5)對集群使用者最省事，照常提交工作負載即可。不過初期需要一定的開發工作量。
 
規避方案三：使用本地DNS緩存
容器的DNS請求都發往本地的DNS緩存服務(dnsmasq, nscd等)，不需要走DNAT，也不會發生conntrack沖突。另外還有個好處，就是避免DNS服務成為性能瓶頸。
 
使用本地DNS緩存有兩種方式：
 
每個容器自帶一個DNS緩存服務
每個節點運行一個DNS緩存服務，所有容器都把本節點的DNS緩存作為自己的nameserver
從資源效率的角度來考慮的話，推薦后一種方式。
 
實施辦法
 
POD中要訪問節點上的DNS緩存服務，可以使用節點的IP。 如果節點上的容器都連在一個虛擬bridge上， 也可以使用這個bridge的三層接口的IP(在TKE中，這個三層接口叫cbr0)。 要確保DNS緩存服務監聽這個地址。
 
如何把POD的/etc/resolv.conf中的nameserver設置為節點IP呢？
 
一個辦法，是設置POD.spec.dnsPolicy為”Default”， 意思是POD里面的/etc/resolv.conf， 使用節點上的文件。缺省使用節點上的/etc/resolv.conf(如果kubelet通過參數–resolv-conf指定了其他文件，則使用–resolv-conf所指定的文件)。
 
另一個辦法，是給每個節點的kubelet指定不同的–cluster-dns參數，設置為節點的IP，POD.spec.dnsPolicy仍然使用缺省值”ClusterFirst”。 kops項目甚至有個issue在討論如何在部署集群時設置好–cluster-dns指向節點IP: https://github.com/kubernetes/kops/issues/5584
posted @ 2024-07-09 11:34 david_cloud 閱讀(118) 評論(0) 收藏舉報
刷新頁面返回頂部
星~空

k8s奪命的5秒DNS延遲

公告