cluster-proportional-autoscaler源码分析及如何解决KubeDNS性能瓶颈技术指南

工作机制

cluster-proportional-autoscaler是kubernetes的孵化项目之一，用来根据集群规模动态的扩缩容指定的namespace下的target（只支持RC, RS, Deployment），还不支持对StatefulSet。目前只提供两种autoscale模式，一种是linear，另一种是ladder，你能很容易的定制开发新的模式，代码接口非常清晰。

cluster-proportional-autoscaler工作机制很简单，每隔一定时间（通过--poll-period-seconds配置，默认10s）重复进行如下操作：

统计一次集群中ScheduableNodes和ScheduableCores；
从apiserver中获取最新configmap数据；
根据对应的autoscale模式，进行configmap参数解析；
据对应的autoscale模式，计算新的期望副本数；
如果与上一次期望副本数不同，则调用Scale接口触发AutoScale；

配置说明

cluster-proportional-autoscaler一共有下面6项flag：

--namespace: 要autoscale的对象所在的namespace；
--target: 要autoscale的对象，只支持deployment/replicationcontroller/replicaset，不区分大小写；
--configmap: 配置实现创建好的configmap，里面存储要使用的模式及其配置，后面会有具体的示例；
--default-params: 如果--configmap中配置的configmap不存在或者后来被删除了，则使用该配置来创建新的configmap，建议要配置；
--poll-period-seconds: 检查周期，默认为10s。
--version: 打印vesion并退出。

源码分析

pollAPIServer

pkg/autoscaler/autoscaler_server.go:82 func (s *AutoScaler) pollAPIServer() { 	// Query the apiserver for the cluster status --- number of nodes and cores 	clusterStatus, err := s.k8sClient.GetClusterStatus() 	if err != nil { 		glog.Errorf("Error while getting cluster status: %v", err) 		return 	} 	glog.V(4).Infof("Total nodes %5d, schedulable nodes: %5d", clusterStatus.TotalNodes, clusterStatus.SchedulableNodes) 	glog.V(4).Infof("Total cores %5d, schedulable cores: %5d", clusterStatus.TotalCores, clusterStatus.SchedulableCores)  	// Sync autoscaler ConfigMap with apiserver 	configMap, err := s.syncConfigWithServer() 	if err != nil || configMap == nil { 		glog.Errorf("Error syncing configMap with apiserver: %v", err) 		return 	}  	// Only sync updated ConfigMap or before controller is set. 	if s.controller == nil || configMap.ObjectMeta.ResourceVersion != s.controller.GetParamsVersion() { 		// Ensure corresponding controller type and scaling params. 		s.controller, err = plugin.EnsureController(s.controller, configMap) 		if err != nil || s.controller == nil { 			glog.Errorf("Error ensuring controller: %v", err) 			return 		} 	}  	// Query the controller for the expected replicas number 	expReplicas, err := s.controller.GetExpectedReplicas(clusterStatus) 	if err != nil { 		glog.Errorf("Error calculating expected replicas number: %v", err) 		return 	} 	glog.V(4).Infof("Expected replica count: %3d", expReplicas)  	// Update resource target with expected replicas. 	_, err = s.k8sClient.UpdateReplicas(expReplicas) 	if err != nil { 		glog.Errorf("Update failure: %s", err) 	} }

GetClusterStatus

GetClusterStatus用于统计集群中SchedulableNodes, SchedulableCores，用于后面计算新的期望副本数。

pkg/autoscaler/k8sclient/k8sclient.go:142 func (k *k8sClient) GetClusterStatus() (clusterStatus *ClusterStatus, err error) { 	opt := metav1.ListOptions{Watch: false}  	nodes, err := k.clientset.CoreV1().Nodes().List(opt) 	if err != nil || nodes == nil { 		return nil, err 	} 	clusterStatus = &ClusterStatus{} 	clusterStatus.TotalNodes = int32(len(nodes.Items)) 	var tc resource.Quantity 	var sc resource.Quantity 	for _, node := range nodes.Items { 		tc.Add(node.Status.Capacity[apiv1.ResourceCPU]) 		if !node.Spec.Unschedulable { 			clusterStatus.SchedulableNodes++ 			sc.Add(node.Status.Capacity[apiv1.ResourceCPU]) 		} 	}  	tcInt64, tcOk := tc.AsInt64() 	scInt64, scOk := sc.AsInt64() 	if !tcOk || !scOk { 		return nil, fmt.Errorf("unable to compute integer values of schedulable cores in the cluster") 	} 	clusterStatus.TotalCores = int32(tcInt64) 	clusterStatus.SchedulableCores = int32(scInt64) 	k.clusterStatus = clusterStatus 	return clusterStatus, nil }

Nodes数量统计时，是会剔除掉那些 Unschedulable Nodes的。
Cores数量统计时，是会减掉那些 Unschedulable Nodes对应Capacity。
- 请注意，这里计算Cores时统计的是Node的Capacity，而不是Allocatable。
- 我认为，使用Allocatable要比Capacity更好。
- 这两者在大规模集群时就会体现出差别了，比如每个Node Allocatable比Capacity少1c4g,那么2K个Node集群规模时，就相差2000c8000g，这将是的target object number相差很大。

有些同学可能要问：Node Allocatable和Capacity有啥不同呢？

Capacity是Node硬件层面提供的全部资源，服务器配置的多少内存，cpu核数等，都是由硬件决定的。
Allocatable则要在Capacity的基础上减去kubelet flag中配置的kube-resreved和system-reserved资源大小，是Kubernetes给应用真正可分配的资源数。

syncConfigWithServer

syncConfigWithServer主要是从apiserver中获取最新configmap数据，注意这里并没有去watch configmap，而是按照--poll-period-seconds（默认10s）定期的去get，所以默认会存在最多10s的延迟。

pkg/autoscaler/autoscaler_server.go:124 func (s *AutoScaler) syncConfigWithServer() (*apiv1.ConfigMap, error) { 	// Fetch autoscaler ConfigMap data from apiserver 	configMap, err := s.k8sClient.FetchConfigMap(s.k8sClient.GetNamespace(), s.configMapName) 	if err == nil { 		return configMap, nil 	} 	if s.defaultParams == nil { 		return nil, err 	} 	glog.V(0).Infof("ConfigMap not found: %v, will create one with default params", err) 	configMap, err = s.k8sClient.CreateConfigMap(s.k8sClient.GetNamespace(), s.configMapName, s.defaultParams) 	if err != nil { 		return nil, err 	} 	return configMap, nil }

如果配置的--configmap在集群中已经存在，则从apiserver中获取最新的configmap并返回；
如果配置的--configmap在集群中不存在，则根据--default-params的内容创建一个configmap并返回；
如果配置的--configmap在集群中不存在，且--default-params又没有配置，则返回nil，意味着失败，整个流程结束，使用时请注意！

建议一定要配置--default-params，因为--configmap配置的configmap有可能有意或者无意的被管理员/用户删除了，而你又没配置--default-params，那么这个时候pollAPIServer将就此结束，因为着你没达到autoscale target的目的，关键是你可能并在不知道集群这个时候出现了这个情况。

EnsureController

EnsureController用来根据configmap中配置的controller type创建对应Controller及解析参数。

pkg/autoscaler/controller/plugin/plugin.go:32  // EnsureController ensures controller type and scaling params func EnsureController(cont controller.Controller, configMap *apiv1.ConfigMap) (controller.Controller, error) { 	// Expect only one entry, which uses the name of control mode as the key 	if len(configMap.Data) != 1 { 		return nil, fmt.Errorf("invalid configMap format, expected only one entry, got: %v", configMap.Data) 	} 	for mode := range configMap.Data { 		// No need to reset controller if control pattern doesn't change 		if cont != nil && mode == cont.GetControllerType() { 			break 		} 		switch mode { 		case laddercontroller.ControllerType: 			cont = laddercontroller.NewLadderController() 		case linearcontroller.ControllerType: 			cont = linearcontroller.NewLinearController() 		default: 			return nil, fmt.Errorf("not a supported control mode: %v", mode) 		} 		glog.V(1).Infof("Set control mode to %v", mode) 	}  	// Sync config with controller 	if err := cont.SyncConfig(configMap); err != nil { 		return nil, fmt.Errorf("Error syncing configMap with controller: %v", err) 	} 	return cont, nil }

检查configmap data中是否只有一个entry，如果不是，则该configmap不合法，流程结束；
检查controller的类型是否为linear或ladder其中之一，并调用对应的方法创建对应的Controller，否则返回失败；
- linear --> NewLinearController
- ladder --> NewLadderController
调用对应Controller的SyncConfig解析configmap data中参数和configmap ResourceVersion更新到Controller对象中；

GetExpectedReplicas

linear和ladder Controller分别实现了自己的GetExpectedReplicas方法，用来计算期望此次监控到的数据应该有的副本数。具体的看下面关于Linear Controller和Ladder Controller部分。

UpdateReplicas

UpdateReplicas将GetExpectedReplicas计算得到的期望副本数，通过调用对应target(rc/rs/deploy)对应的Scale接口，由Scale去完成target的缩容扩容。

pkg/autoscaler/k8sclient/k8sclient.go:172 func (k *k8sClient) UpdateReplicas(expReplicas int32) (prevRelicas int32, err error) { 	scale, err := k.clientset.Extensions().Scales(k.target.namespace).Get(k.target.kind, k.target.name) 	if err != nil { 		return 0, err 	} 	prevRelicas = scale.Spec.Replicas 	if expReplicas != prevRelicas { 		glog.V(0).Infof("Cluster status: SchedulableNodes[%v], SchedulableCores[%v]", k.clusterStatus.SchedulableNodes, k.clusterStatus.SchedulableCores) 		glog.V(0).Infof("Replicas are not as expected : updating replicas from %d to %d", prevRelicas, expReplicas) 		scale.Spec.Replicas = expReplicas 		_, err = k.clientset.Extensions().Scales(k.target.namespace).Update(k.target.kind, scale) 		if err != nil { 			return 0, err 		} 	} 	return prevRelicas, nil }

下面是对Linear Controller和Ladder Controller具体实现的代码分析。

Linear Controller

先来看看linear Controller的参数：

pkg/autoscaler/controller/linearcontroller/linear_controller.go:50 type linearParams struct { 	CoresPerReplica           float64 `json:"coresPerReplica"` 	NodesPerReplica           float64 `json:"nodesPerReplica"` 	Min                       int     `json:"min"` 	Max                       int     `json:"max"` 	PreventSinglePointFailure bool    `json:"preventSinglePointFailure"` }

写configmap时，参考如下：

kind: ConfigMap apiVersion: v1 metadata:   name: nginx-autoscaler   namespace: default data:   linear: |-     {        "coresPerReplica": 2,       "nodesPerReplica": 1,       "preventSinglePointFailure": true,       "min": 1,       "max": 100     }

其他参数不多说，我想提的是PreventSinglePointFailure,字面意思是防止单点故障，是一个bool值，代码中没有进行显示的初始化，意味着默认为false。可以在对应的configmap data或者dafault-params中设置"preventSinglePointFailure": true，但设置为true后，如果schedulableNodes > 1，则会保证target's replicas至少为2，也就是防止了target单点故障。

pkg/autoscaler/controller/linearcontroller/linear_controller.go:101  func (c *LinearController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { 	// Get the expected replicas for the currently schedulable nodes and cores 	expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores)))  	return expReplicas, nil }  func (c *LinearController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { 	replicasFromCore := c.getExpectedReplicasFromParam(schedulableCores, c.params.CoresPerReplica) 	replicasFromNode := c.getExpectedReplicasFromParam(schedulableNodes, c.params.NodesPerReplica) 	// Prevent single point of failure by having at least 2 replicas when 	// there are more than one node. 	if c.params.PreventSinglePointFailure && 		schedulableNodes > 1 && 		replicasFromNode < 2 { 		replicasFromNode = 2 	}  	// Returns the results which yields the most replicas 	if replicasFromCore > replicasFromNode { 		return replicasFromCore 	} 	return replicasFromNode }  func (c *LinearController) getExpectedReplicasFromParam(schedulableResources int, resourcesPerReplica float64) int { 	if resourcesPerReplica == 0 { 		return 1 	} 	res := math.Ceil(float64(schedulableResources) / resourcesPerReplica) 	if c.params.Max != 0 { 		res = math.Min(float64(c.params.Max), res) 	} 	return int(math.Max(float64(c.params.Min), res)) }

根据schedulableCores和configmap中的CoresPerReplica，按照如下公式计算得到replicasFromCore；
- replicasFromCore = ceil( schedulableCores * 1/CoresPerReplica )
根据schedulableNodes和configmap中的NodesPerReplica，按照如下公式计算得到replicasFromNode；
- replicasFromNode = ceil( schedulableNodes * 1/NodesPerReplica ) )
如果configmap中配置了min或者max，则必须保证replicas在min和max范围内；
- replicas = min(replicas, max)
- replicas = max(replicas, min)
如果配置了preventSinglePointFailure为true并且schedulableNodes > 1，则根据前面提到的逻辑进行防止单点故障，replicasFromNode必须大于2；
- replicasFromNode = max(2, replicasFromNode)
返回replicasFromNode和replicasFromCore中的最大者作为期望副本数。

概括起来，linear controller计算replicas的公式为：

replicas = max( ceil( cores * 1/coresPerReplica ) , ceil( nodes * 1/nodesPerReplica ) ) replicas = min(replicas, max) replicas = max(replicas, min)

Ladder Controller

下面是ladder Controller的参数结构：

pkg/autoscaler/controller/laddercontroller/ladder_controller.go:66 type paramEntry [2]int type paramEntries []paramEntry type ladderParams struct { 	CoresToReplicas paramEntries `json:"coresToReplicas"` 	NodesToReplicas paramEntries `json:"nodesToReplicas"` }

写configmap时，参考如下：

kind: ConfigMap apiVersion: v1 metadata:   name: nginx-autoscaler   namespace: default data:   ladder: |-     {        "coresToReplicas":       [         [ 1,1 ],         [ 3,3 ],         [256,4],         [ 512,5 ],         [ 1024,7 ]       ],       "nodesToReplicas":       [         [ 1,1 ],         [ 2,2 ],         [100, 5],         [200, 12]       ]     }

下面是ladder Controller对应的计算期望副本值的方法。

func (c *LadderController) GetExpectedReplicas(status *k8sclient.ClusterStatus) (int32, error) { 	// Get the expected replicas for the currently schedulable nodes and cores 	expReplicas := int32(c.getExpectedReplicasFromParams(int(status.SchedulableNodes), int(status.SchedulableCores)))  	return expReplicas, nil }  func (c *LadderController) getExpectedReplicasFromParams(schedulableNodes, schedulableCores int) int { 	replicasFromCore := getExpectedReplicasFromEntries(schedulableCores, c.params.CoresToReplicas) 	replicasFromNode := getExpectedReplicasFromEntries(schedulableNodes, c.params.NodesToReplicas)  	// Returns the results which yields the most replicas 	if replicasFromCore > replicasFromNode { 		return replicasFromCore 	} 	return replicasFromNode }  func getExpectedReplicasFromEntries(schedulableResources int, entries []paramEntry) int { 	if len(entries) == 0 { 		return 1 	} 	// Binary search for the corresponding replicas number 	pos := sort.Search( 		len(entries), 		func(i int) bool { 			return schedulableResources < entries[i][0] 		}) 	if pos > 0 { 		pos = pos - 1 	} 	return entries[pos][1] }

根据schedulableCores在configmap中的CoresToReplicas定义的那个范围中，就选择预先设定的期望副本数。
根据schedulableNodes在configmap中的NodesToReplicas定义的那个范围中，就选择预先设定的期望副本数。
返回上面两者中的最大者作为期望副本数。

注意：

ladder模式下，没有防止单点故障的设置项，用户配置configmap时候要自己注意；
ladder模式下，没有NodesToReplicas或者CoresToReplicas对应的配置为空，则对应的replicas设为1；

比如前面举例的configmap，如果集群中schedulableCores=400（对应期望副本为4），schedulableNodes=120（对应期望副本为5），则最终的期望副本数为5.

使用kube-dns-autoscaler解决KubeDNS性能瓶颈问题

通过如下yaml文件创建kube-dns-autoscaler Deployment和configmap, kube-dns-autoscaler每个30s会进行一次副本数计算检查，并可能触发AutoScale。

kind: ConfigMap    apiVersion: v1    metadata:      name: kube‐dns‐autoscaler      namespace: kube‐system    data: linear: |‐ {          "coresPerReplica": 300,          "nodesPerReplica": 10,          "min": 1,          "max": 100,          "PreventSinglePointFailure": true }   ‐‐‐  apiVersion: apps/v1beta1 kind: Deployment metadata:   name: kube-dns-autoscaler   namespace: kube-system   labels:     k8s-app: kube-dns-autoscaler spec:   template:     metadata:       labels:         k8s-app: kube-dns-autoscaler     spec:       containers:       - name: autoscaler         image: gcr.io/google_containers/cluster-proportional-autoscaler-amd64:1.0.0         resources:             requests:                 cpu: "20m"                 memory: "10Mi"         command:           - /cluster-proportional-autoscaler           - --namespace=kube-system           - --configmap=kube-dns-autoscaler           - --target=Deployment/kube-dns           - --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"min":1}}           - poll-period-seconds=30s           - --logtostderr=true           - --v=2

总结和展望

cluster-proportional-autoscaler代码很简单，工作机制也很单纯，我们希望用它根据集群规模来动态扩展KubeDNS，以解决TensorFlow on Kubernetes项目中大规模的域名解析性能问题。
目前它只支持根据SchedulableNodes和SchedulableCores来autoscale，在AI的场景中，存在集群资源极度压榨的情况，一个集群承载的svc和pod波动范围很大，后续我们可能会开发根据service number来autoscale kubedns的controller。
另外，我还考虑将KubeDNS的部署从AI训练服务器中隔离出来，因为训练时经常会将服务器cpu跑到95%以上，KubeDNS也部署在这台服务器上的话，势必也会影响KubeDNS性能。

本文发表于2017年11月30日 08:33
(c)注：本文转载自https://my.oschina.net/jxcdwangtao/blog/1581879，转载目的在于传递更多信息，并不代表本网赞同其观点和对其真实性负责。如有侵权行为，请联系我们，我们会及时删除.

阅读 3939 讨论 0 喜欢 0

快捷链接
网站地图
提交友链