OpenStack Nova调度策略研究笔记


声明:本文转载自https://my.oschina.net/LastRitter/blog/1649954,转载目的在于传递更多信息,仅供学习交流之用。如有侵权行为,请联系我,我会及时删除。

概述

在创建一个新虚拟机实例时,Nova Scheduler通过配置好的Filter Scheduler对所有计算节点进行过滤(filtering)和称重(weighting),最后根据称重高低和用户请求节点个数返回可用主机列表。如果失败,则表明没有可用的主机。

###标准过滤器

filtering-workflow-1.png

  • AllHostsFilter - 不进行过滤,所有可见的主机都会通过。

  • ImagePropertiesFilter - 根据镜像元数据进行过滤。

  • AvailabilityZoneFilter - 根据可用区域进行过滤(Availability Zone元数据)。

  • ComputeCapabilitiesFilter - 根据计算能力进行过滤,通过请求创建虚拟机时指定的参数与主机的属性和状态进行匹配来确定是否通过,可用的操作符如下:

    * = (equal to or greater than as a number; same as vcpus case) * == (equal to as a number) * != (not equal to as a number) * >= (greater than or equal to as a number) * <= (less than or equal to as a number) * s== (equal to as a string) * s!= (not equal to as a string) * s>= (greater than or equal to as a string) * s> (greater than as a string) * s<= (less than or equal to as a string) * s< (less than as a string) * <in> (substring) * <all-in> (all elements contained in collection) * <or> (find one of these)  Examples are: ">= 5", "s== 2.1.0", "<in> gcc", "<all-in> aes mmx", and "<or> fpu <or> gpu" 

    部分可用的属性:

    * free_ram_mb (compared with a number, values like ">= 4096") * free_disk_mb (compared with a number, values like ">= 10240") * host (compared with a string, values like: "<in> compute","s== compute_01") * hypervisor_type (compared with a string, values like: "s== QEMU", "s== powervm") * hypervisor_version (compared with a number, values like : ">= 1005003", "== 2000000") * num_instances (compared with a number, values like: "<= 10") * num_io_ops (compared with a number, values like: "<= 5") * vcpus_total (compared with a number, values like: "= 48", ">=24") * vcpus_used (compared with a number, values like: "= 0", "<= 10") 
  • AggregateInstanceExtraSpecsFilter - 根据额外的主机属性进行过滤(Host Aggregate元数据),与ComputeCapabilitiesFilter类似。

  • ComputeFilter - 根据主机的状态和服务的可用性过滤。

  • CoreFilter AggregateCoreFilter - 根据剩余可用的CPU个数进行过滤。

  • IsolatedHostsFilter - 根据nova.conf中的image_isolatedhost_isolated,和restrict_isolated_hosts_to_isolated_images 标志进行过滤,用于节点隔离。

  • JsonFilter - 根据JSON语句来过滤。

  • RamFilter AggregateRamFilter - 根据内存来过滤。

  • DiskFilter AggregateDiskFilter - 根据磁盘空间来过滤。

  • NumInstancesFilter AggregateNumInstancesFilter - 根据节点实例个数来过滤。

  • IoOpsFilter AggregateIoOpsFilter - 根据IO状况过滤。

  • PciPassthroughFilter - 根据请求的PCI设备进行过滤。

  • SimpleCIDRAffinityFilter - 在同一个IP子网上创建虚拟机。

  • SameHostFilter - 在与一个实例相同的主机上启动实例。

  • RetryFilter - 过滤掉已经尝试过的主机。

  • AggregateTypeAffinityFilter - 限定一个Aggregate中创建的实例类型(Flavor类型)。

  • ServerGroupAntiAffinityFilter - 尽量把实例部署在不同主机。

  • ServerGroupAffinityFilter - 尽量把实例部署在相同主机。

  • AggregateMultiTenancyIsolation - 把租户隔离在指定的Aggregate。

  • AggregateImagePropertiesIsolation - 根据镜像属性和Aggregate属性隔离主机。

  • MetricsFilter - 根据weight_setting 过滤主机,只有具备可用测量值的主机被通过。

  • NUMATopologyFilter - 根据实例的NUMA要求过滤主机。

###权重计算

filtering-workflow-2

当过滤后如果有多个主机,则需要进行权重计算,最后选出权重最高的主机,公式如下:

weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ... 

每一项都由“权重系数”(wN_multiplier)乘以“称重值”(norm(wN)),“权重系数”通过配置文件获取,“称重值”由“称重对象”(Weight Object)动态生成,目前可用的“称重对象”主要有:RAMWeigher,DiskWeigher,MetricsWeigher,IoOpsWeigher,PCIWeigher,ServerGroupSoftAffinityWeigher和ServerGroupSoftAntiAffinityWeigher。

###常见策略

根据不同的需求,可以制定出不同的调度策略,使用调度插件进行组合,以满足需求。下面是一些常见的调度策略:

  • Packing: 虚拟机尽量放置在含有虚拟机数量最多的主机上。

  • Stripping: 虚拟机尽量放置在含有虚拟机数量最少的主机上。

  • CPU load balance:虚拟机尽量放在可用core最多的主机上。

  • Memory load balance:虚拟机尽量放在可用memory 最多的主机上。

  • Affinity : 多个虚拟机需要放置在相同的主机上。

  • AntiAffinity: 多个虚拟机需要放在在不同的主机上。

  • CPU Utilization load balance:虚拟机尽量放在CPU利用率最低的主机上。

##元数据

###调度测试

由于各种元数据过滤方法都大同小异,而Flavor元数据没有太多预定义的值,处理比较自由,因此这里以Flavor元数据过滤器进行测试。

####过滤器配置

新增AggregateInstanceExtraSpecsFilter过滤器:

$ vi /etc/kolla/nova-scheduler/nova.conf [DEFAULT] ... scheduler_default_filters = AggregateInstanceExtraSpecsFilter, RetryFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter ...  $ docker restart nova_scheduler 

####集合配置

  • 创建io-fast集合:
$ nova aggregate-create io-fast +----+---------+-------------------+-------+----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts | Metadata | UUID                                 | +----+---------+-------------------+-------+----------+--------------------------------------+ | 8  | io-fast | -                 |       |          | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd | +----+---------+-------------------+-------+----------+--------------------------------------+ $ nova aggregate-set-metadata io-fast io=fast Metadata has been successfully updated for aggregate 8. +----+---------+-------------------+-------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts | Metadata  | UUID                                 | +----+---------+-------------------+-------+-----------+--------------------------------------+ | 8  | io-fast | -                 |       | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd | +----+---------+-------------------+-------+-----------+--------------------------------------+  $ nova aggregate-add-host io-fast osdev-01 Host osdev-01 has been successfully added for aggregate 8  +----+---------+-------------------+------------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts      | Metadata  | UUID                                 | +----+---------+-------------------+------------+-----------+--------------------------------------+ | 8  | io-fast | -                 | 'osdev-01' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd | +----+---------+-------------------+------------+-----------+--------------------------------------+  $ nova aggregate-add-host io-fast osdev-02 Host osdev-02 has been successfully added for aggregate 8  +----+---------+-------------------+------------------------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts                  | Metadata  | UUID                                 | +----+---------+-------------------+------------------------+-----------+--------------------------------------+ | 8  | io-fast | -                 | 'osdev-01', 'osdev-02' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd | +----+---------+-------------------+------------------------+-----------+--------------------------------------+  $ nova aggregate-add-host io-fast osdev-03 Host osdev-03 has been successfully added for aggregate 8  +----+---------+-------------------+------------------------------------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts                              | Metadata  | UUID                                 | +----+---------+-------------------+------------------------------------+-----------+--------------------------------------+ | 8  | io-fast | -                 | 'osdev-01', 'osdev-02', 'osdev-03' | 'io=fast' | 2523c96a-46ee-4fac-ba8a-5b50a4d1ebbd | +----+---------+-------------------+------------------------------------+-----------+--------------------------------------+ 
  • 创建io-slow集合:
$ nova aggregate-create io-slow +----+---------+-------------------+-------+----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts | Metadata | UUID                                 | +----+---------+-------------------+-------+----------+--------------------------------------+ | 9  | io-slow | -                 |       |          | d10d2eaf-43d7-464e-bc12-10f18897b476 | +----+---------+-------------------+-------+----------+--------------------------------------+  $ nova aggregate-set-metadata io-slow io=slow Metadata has been successfully updated for aggregate 9. +----+---------+-------------------+-------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts | Metadata  | UUID                                 | +----+---------+-------------------+-------+-----------+--------------------------------------+ | 9  | io-slow | -                 |       | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 | +----+---------+-------------------+-------+-----------+--------------------------------------+  $ nova aggregate-add-host io-slow osdev-gpu Host osdev-gpu has been successfully added for aggregate 9  +----+---------+-------------------+-------------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts       | Metadata  | UUID                                 | +----+---------+-------------------+-------------+-----------+--------------------------------------+ | 9  | io-slow | -                 | 'osdev-gpu' | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 | +----+---------+-------------------+-------------+-----------+--------------------------------------+  $ nova aggregate-add-host io-slow osdev-ceph Host osdev-ceph has been successfully added for aggregate 9  +----+---------+-------------------+---------------------------+-----------+--------------------------------------+ | Id | Name    | Availability Zone | Hosts                     | Metadata  | UUID                                 | +----+---------+-------------------+---------------------------+-----------+--------------------------------------+ | 9  | io-slow | -                 | 'osdev-gpu', 'osdev-ceph' | 'io=slow' | d10d2eaf-43d7-464e-bc12-10f18897b476 | +----+---------+-------------------+---------------------------+-----------+--------------------------------------+ 

####模板配置

  • 创建io-fast虚拟机模板:
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.fast $ nova flavor-key machine.fast set io=fast  $ openstack flavor show machine.fast +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | access_project_ids         | None                                 | | disk                       | 1                                    | | id                         | 4c8a6d15-270d-464b-bd3b-303d167af4cb | | name                       | machine.fast                         | | os-flavor-access:is_public | True                                 | | properties                 | io='fast'                            | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+ 
  • 创建io-slow虚拟机模板:
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.slow $ nova flavor-key machine.slow set io=slow  $ openstack flavor show machine.slow +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | access_project_ids         | None                                 | | disk                       | 1                                    | | id                         | f6a0fdad-3f20-40ed-a4fc-0ba49ff4ff02 | | name                       | machine.slow                         | | os-flavor-access:is_public | True                                 | | properties                 | io='slow'                            | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+ 

####创建虚拟机

  • 创建io-fast虚拟机:
$ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast1  $ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast2  $ openstack server create --image cirros --flavor machine.fast --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.fast3 
  • 创建io-slow虚拟机:
$ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow1  $ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow2  $ openstack server create --image cirros --flavor machine.slow --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.slow3  
  • 查看虚拟机被调度的节点:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +--------------------+--------+--------------------+-------------------+------------+ | Name               | Status | Networks           | Availability Zone | Host       | +--------------------+--------+--------------------+-------------------+------------+ | server.slow3       | ACTIVE | demo-net=10.0.0.20 | az02              | osdev-gpu  | | server.slow2       | ACTIVE | demo-net=10.0.0.17 | az02              | osdev-ceph | | server.slow1       | ACTIVE | demo-net=10.0.0.14 | az02              | osdev-gpu  | | server.fast3       | ACTIVE | demo-net=10.0.0.13 | az01              | osdev-01   | | server.fast2       | ACTIVE | demo-net=10.0.0.16 | az01              | osdev-02   | | server.fast1       | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-03   | +--------------------+--------+--------------------+-------------------+------------+ 

###相关源码

请求参数

命令参数

  • 查看命令帮助:
$ openstack server create usage: openstack server create [-h] [-f {json,shell,table,value,yaml}]                                [-c COLUMN] [--max-width <integer>]                                [--fit-width] [--print-empty] [--noindent]                                [--prefix PREFIX]                                (--image <image> | --volume <volume>) --flavor                                <flavor>                                [--security-group <security-group-name>]                                [--key-name <key-name>]                                [--property <key=value>]                                [--file <dest-filename=source-filename>]                                [--user-data <user-data>]                                [--availability-zone <zone-name>]                                [--block-device-mapping <dev-name=mapping>]                                [--nic <net-id=net-uuid,v4-fixed-ip=ip-addr,v6-fixed-ip=ip-addr,port-id=port-uuid>]                                [--hint <key=value>]                                [--config-drive <config-drive-volume>|True]                                [--min <count>] [--max <count>] [--wait]                                <server-name> openstack server create: error: too few arguments 

影响调度的直接输入参数有[--availability-zone <zone-name>][--hint <key=value>]

  • 由Nova API生成的请求参数(nova/objects/request_spec.py ):
... @base.NovaObjectRegistry.register class RequestSpec(base.NovaObject):     # Version 1.0: Initial version     # Version 1.1: ImageMeta version 1.6     # Version 1.2: SchedulerRetries version 1.1     # Version 1.3: InstanceGroup version 1.10     # Version 1.4: ImageMeta version 1.7     # Version 1.5: Added get_by_instance_uuid(), create(), save()     # Version 1.6: Added requested_destination     # Version 1.7: Added destroy()     # Version 1.8: Added security_groups     VERSION = '1.8'      fields = {         'id': fields.IntegerField(),         'image': fields.ObjectField('ImageMeta', nullable=True),         'numa_topology': fields.ObjectField('InstanceNUMATopology',                                             nullable=True),         'pci_requests': fields.ObjectField('InstancePCIRequests',                                            nullable=True),         'project_id': fields.StringField(nullable=True),         'availability_zone': fields.StringField(nullable=True),         'flavor': fields.ObjectField('Flavor', nullable=False),         'num_instances': fields.IntegerField(default=1),         'ignore_hosts': fields.ListOfStringsField(nullable=True),         'force_hosts': fields.ListOfStringsField(nullable=True),         'force_nodes': fields.ListOfStringsField(nullable=True),         'requested_destination': fields.ObjectField('Destination',                                                     nullable=True,                                                     default=None),         'retry': fields.ObjectField('SchedulerRetries', nullable=True),         'limits': fields.ObjectField('SchedulerLimits', nullable=True),         'instance_group': fields.ObjectField('InstanceGroup', nullable=True),         # NOTE(sbauza): Since hints are depending on running filters, we prefer         # to leave the API correctly validating the hints per the filters and         # just provide to the RequestSpec object a free-form dictionary         'scheduler_hints': fields.DictOfListOfStringsField(nullable=True),         'instance_uuid': fields.UUIDField(),         'security_groups': fields.ObjectField('SecurityGroupList'),     }  ... 

####主机状态

主机状态中的主要信息(nova/scheduler/host_manager.py):

class HostState(object):     """Mutable and immutable information tracked for a host.     This is an attempt to remove the ad-hoc data structures     previously used and lock down access.     """      def __init__(self, host, node):         self.host = host         self.nodename = node         self._lock_name = (host, node)          # Mutable available resources.         # These will change as resources are virtually "consumed".         self.total_usable_ram_mb = 0         self.total_usable_disk_gb = 0         self.disk_mb_used = 0         self.free_ram_mb = 0         self.free_disk_mb = 0         self.vcpus_total = 0         self.vcpus_used = 0         self.pci_stats = None         self.numa_topology = None          # Additional host information from the compute node stats:         self.num_instances = 0         self.num_io_ops = 0          # Other information         self.host_ip = None         self.hypervisor_type = None         self.hypervisor_version = None         self.hypervisor_hostname = None         self.cpu_info = None         self.supported_instances = None          # Resource oversubscription values for the compute host:         self.limits = {}          # Generic metrics from compute nodes         self.metrics = None          # List of aggregates the host belongs to         self.aggregates = []          # Instances on this host         self.instances = {}          # Allocation ratios for this host         self.ram_allocation_ratio = None         self.cpu_allocation_ratio = None         self.disk_allocation_ratio = None          self.updated = None  

####Flavor元数据和过滤器

Flavor元数据主要在extra_spec字段,基本都没有明确定义,可自由使用。

  • AggregateInstanceExtraSpecsFilter过滤器,使用aggregate_instance_extra_specs域和没有域的元数据进行判断(nova/scheduler/filters/aggregate_instance_extra_specs.py):
from oslo_log import log as logging   from nova.scheduler import filters from nova.scheduler.filters import extra_specs_ops from nova.scheduler.filters import utils   LOG = logging.getLogger(__name__)  _SCOPE = 'aggregate_instance_extra_specs'   class AggregateInstanceExtraSpecsFilter(filters.BaseHostFilter):     """AggregateInstanceExtraSpecsFilter works with InstanceType records."""      # Aggregate data and instance type does not change within a request     run_filter_once_per_request = True      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         """Return a list of hosts that can create instance_type          Check that the extra specs associated with the instance type match         the metadata provided by aggregates.  If not present return False.         """         instance_type = spec_obj.flavor         # If 'extra_specs' is not present or extra_specs are empty then we         # need not proceed further         if (not instance_type.obj_attr_is_set('extra_specs')                 or not instance_type.extra_specs):             return True          metadata = utils.aggregate_metadata_get_by_host(host_state)          for key, req in instance_type.extra_specs.items():             # Either not scope format, or aggregate_instance_extra_specs scope             scope = key.split(':', 1)             if len(scope) > 1:                 if scope[0] != _SCOPE:                     continue                 else:                     del scope[0]             key = scope[0]             aggregate_vals = metadata.get(key, None)             if not aggregate_vals:                 LOG.debug("%(host_state)s fails instance_type extra_specs "                     "requirements. Extra_spec %(key)s is not in aggregate.",                     {'host_state': host_state, 'key': key})                 return False             for aggregate_val in aggregate_vals:                 if extra_specs_ops.match(aggregate_val, req):                     break             else:                 LOG.debug("%(host_state)s fails instance_type extra_specs "                             "requirements. '%(aggregate_vals)s' do not "                             "match '%(req)s'",                           {'host_state': host_state, 'req': req,                            'aggregate_vals': aggregate_vals})                 return False         return True 
  • 元数据匹配的相关运算符(nova/scheduler/filters/extra_specs_ops.py):
import operator  # 1. The following operations are supported: #   =, s==, s!=, s>=, s>, s<=, s<, <in>, <all-in>, <or>, ==, !=, >=, <= # 2. Note that <or> is handled in a different way below. # 3. If the first word in the extra_specs is not one of the operators, #   it is ignored. op_methods = {'=': lambda x, y: float(x) >= float(y),                '<in>': lambda x, y: y in x,                '<all-in>': lambda x, y: all(val in x for val in y),                '==': lambda x, y: float(x) == float(y),                '!=': lambda x, y: float(x) != float(y),                '>=': lambda x, y: float(x) >= float(y),                '<=': lambda x, y: float(x) <= float(y),                's==': operator.eq,                's!=': operator.ne,                's<': operator.lt,                's<=': operator.le,                's>': operator.gt,                's>=': operator.ge}   def match(value, req):     words = req.split()      op = method = None     if words:         op = words.pop(0)         method = op_methods.get(op)      if op != '<or>' and not method:         return value == req      if value is None:         return False      if op == '<or>':  # Ex: <or> v1 <or> v2 <or> v3         while True:             if words.pop(0) == value:                 return True             if not words:                 break             words.pop(0)  # remove a keyword <or>             if not words:                 break         return False      if words:         if op == '<all-in>':  # requires a list not a string             return method(value, words)         return method(value, words[0])     return False 

####Image元数据和过滤器

  • Image基本属性值(nova/objects/image_meta.py):
@base.NovaObjectRegistry.register class ImageMeta(base.NovaObject):      fields = {         'id': fields.UUIDField(),         'name': fields.StringField(),         'status': fields.StringField(),         'visibility': fields.StringField(),         'protected': fields.FlexibleBooleanField(),         'checksum': fields.StringField(),         'owner': fields.StringField(),         'size': fields.IntegerField(),         'virtual_size': fields.IntegerField(),         'container_format': fields.StringField(),         'disk_format': fields.StringField(),         'created_at': fields.DateTimeField(nullable=True),         'updated_at': fields.DateTimeField(nullable=True),         'tags': fields.ListOfStringsField(),         'direct_url': fields.StringField(),         'min_ram': fields.IntegerField(),         'min_disk': fields.IntegerField(),         'properties': fields.ObjectField('ImageMetaProps'),     } 
  • Image元数据中的可用值(nova/objects/image_meta.py):
@base.NovaObjectRegistry.register class ImageMetaProps(base.NovaObject):     # Version 1.0: Initial version     # Version 1.1: added os_require_quiesce field     # Version 1.2: added img_hv_type and img_hv_requested_version fields     # Version 1.3: HVSpec version 1.1     # Version 1.4: added hw_vif_multiqueue_enabled field     # Version 1.5: added os_admin_user field     # Version 1.6: Added 'lxc' and 'uml' enum types to DiskBusField     # Version 1.7: added img_config_drive field     # Version 1.8: Added 'lxd' to hypervisor types     # Version 1.9: added hw_cpu_thread_policy field     # Version 1.10: added hw_cpu_realtime_mask field     # Version 1.11: Added hw_firmware_type field     # Version 1.12: Added properties for image signature verification     # Version 1.13: added os_secure_boot field     # Version 1.14: Added 'hw_pointer_model' field     # Version 1.15: Added hw_rescue_bus and hw_rescue_device.     # Version 1.16: WatchdogActionField supports 'disabled' enum.     VERSION = '1.16'      def obj_make_compatible(self, primitive, target_version):         super(ImageMetaProps, self).obj_make_compatible(primitive,                                                         target_version)         target_version = versionutils.convert_version_to_tuple(target_version)         if target_version < (1, 16) and 'hw_watchdog_action' in primitive:             # Check to see if hw_watchdog_action was set to 'disabled' and if             # so, remove it since not specifying it is the same behavior.             if primitive['hw_watchdog_action'] == \                     fields.WatchdogAction.DISABLED:                 primitive.pop('hw_watchdog_action')         if target_version < (1, 15):             primitive.pop('hw_rescue_bus', None)             primitive.pop('hw_rescue_device', None)         if target_version < (1, 14):             primitive.pop('hw_pointer_model', None)         if target_version < (1, 13):             primitive.pop('os_secure_boot', None)         if target_version < (1, 11):             primitive.pop('hw_firmware_type', None)         if target_version < (1, 10):             primitive.pop('hw_cpu_realtime_mask', None)         if target_version < (1, 9):             primitive.pop('hw_cpu_thread_policy', None)         if target_version < (1, 7):             primitive.pop('img_config_drive', None)         if target_version < (1, 5):             primitive.pop('os_admin_user', None)         if target_version < (1, 4):             primitive.pop('hw_vif_multiqueue_enabled', None)         if target_version < (1, 2):             primitive.pop('img_hv_type', None)             primitive.pop('img_hv_requested_version', None)         if target_version < (1, 1):             primitive.pop('os_require_quiesce', None)          if target_version < (1, 6):             bus = primitive.get('hw_disk_bus', None)             if bus in ('lxc', 'uml'):                 raise exception.ObjectActionError(                     action='obj_make_compatible',                     reason='hw_disk_bus=%s not supported in version %s' % (                         bus, target_version))      # Maximum number of NUMA nodes permitted for the guest topology     NUMA_NODES_MAX = 128      # 'hw_' - settings affecting the guest virtual machine hardware     # 'img_' - settings affecting the use of images by the compute node     # 'os_' - settings affecting the guest operating system setup      fields = {         # name of guest hardware architecture eg i686, x86_64, ppc64         'hw_architecture': fields.ArchitectureField(),          # used to decide to expand root disk partition and fs to full size of         # root disk         'hw_auto_disk_config': fields.StringField(),          # whether to display BIOS boot device menu         'hw_boot_menu': fields.FlexibleBooleanField(),          # name of the CDROM bus to use eg virtio, scsi, ide         'hw_cdrom_bus': fields.DiskBusField(),          # preferred number of CPU cores per socket         'hw_cpu_cores': fields.IntegerField(),          # preferred number of CPU sockets         'hw_cpu_sockets': fields.IntegerField(),          # maximum number of CPU cores per socket         'hw_cpu_max_cores': fields.IntegerField(),          # maximum number of CPU sockets         'hw_cpu_max_sockets': fields.IntegerField(),          # maximum number of CPU threads per core         'hw_cpu_max_threads': fields.IntegerField(),          # CPU allocation policy         'hw_cpu_policy': fields.CPUAllocationPolicyField(),          # CPU thread allocation policy         'hw_cpu_thread_policy': fields.CPUThreadAllocationPolicyField(),          # CPU mask indicates which vCPUs will have realtime enable,         # example ^0-1 means that all vCPUs except 0 and 1 will have a         # realtime policy.         'hw_cpu_realtime_mask': fields.StringField(),          # preferred number of CPU threads per core         'hw_cpu_threads': fields.IntegerField(),          # guest ABI version for guest xentools either 1 or 2 (or 3 - depends on         # Citrix PV tools version installed in image)         'hw_device_id': fields.IntegerField(),          # name of the hard disk bus to use eg virtio, scsi, ide         'hw_disk_bus': fields.DiskBusField(),          # allocation mode eg 'preallocated'         'hw_disk_type': fields.StringField(),          # name of the floppy disk bus to use eg fd, scsi, ide         'hw_floppy_bus': fields.DiskBusField(),          # This indicates the guest needs UEFI firmware         'hw_firmware_type': fields.FirmwareTypeField(),          # boolean - used to trigger code to inject networking when booting a CD         # image with a network boot image         'hw_ipxe_boot': fields.FlexibleBooleanField(),          # There are sooooooooooo many possible machine types in         # QEMU - several new ones with each new release - that it         # is not practical to enumerate them all. So we use a free         # form string         'hw_machine_type': fields.StringField(),          # One of the magic strings 'small', 'any', 'large'         # or an explicit page size in KB (eg 4, 2048, ...)         'hw_mem_page_size': fields.StringField(),          # Number of guest NUMA nodes         'hw_numa_nodes': fields.IntegerField(),          # Each list entry corresponds to a guest NUMA node and the         # set members indicate CPUs for that node         'hw_numa_cpus': fields.ListOfSetsOfIntegersField(),          # Each list entry corresponds to a guest NUMA node and the         # list value indicates the memory size of that node.         'hw_numa_mem': fields.ListOfIntegersField(),          # Generic property to specify the pointer model type.         'hw_pointer_model': fields.PointerModelField(),          # boolean 'yes' or 'no' to enable QEMU guest agent         'hw_qemu_guest_agent': fields.FlexibleBooleanField(),          # name of the rescue bus to use with the associated rescue device.         'hw_rescue_bus': fields.DiskBusField(),          # name of rescue device to use.         'hw_rescue_device': fields.BlockDeviceTypeField(),          # name of the RNG device type eg virtio         'hw_rng_model': fields.RNGModelField(),          # number of serial ports to create         'hw_serial_port_count': fields.IntegerField(),          # name of the SCSI bus controller eg 'virtio-scsi', 'lsilogic', etc         'hw_scsi_model': fields.SCSIModelField(),          # name of the video adapter model to use, eg cirrus, vga, xen, qxl         'hw_video_model': fields.VideoModelField(),          # MB of video RAM to provide eg 64         'hw_video_ram': fields.IntegerField(),          # name of a NIC device model eg virtio, e1000, rtl8139         'hw_vif_model': fields.VIFModelField(),          # "xen" vs "hvm"         'hw_vm_mode': fields.VMModeField(),          # action to take when watchdog device fires eg reset, poweroff, pause,         # none         'hw_watchdog_action': fields.WatchdogActionField(),          # boolean - If true, this will enable the virtio-multiqueue feature         'hw_vif_multiqueue_enabled': fields.FlexibleBooleanField(),          # if true download using bittorrent         'img_bittorrent': fields.FlexibleBooleanField(),          # Which data format the 'img_block_device_mapping' field is         # using to represent the block device mapping         'img_bdm_v2': fields.FlexibleBooleanField(),          # Block device mapping - the may can be in one or two completely         # different formats. The 'img_bdm_v2' field determines whether         # it is in legacy format, or the new current format. Ideally         # we would have a formal data type for this field instead of a         # dict, but with 2 different formats to represent this is hard.         # See nova/block_device.py from_legacy_mapping() for the complex         # conversion code. So for now leave it as a dict and continue         # to use existing code that is able to convert dict into the         # desired internal BDM formats         'img_block_device_mapping':             fields.ListOfDictOfNullableStringsField(),          # boolean - if True, and image cache set to "some" decides if image         # should be cached on host when server is booted on that host         'img_cache_in_nova': fields.FlexibleBooleanField(),          # Compression level for images. (1-9)         'img_compression_level': fields.IntegerField(),          # hypervisor supported version, eg. '>=2.6'         'img_hv_requested_version': fields.VersionPredicateField(),          # type of the hypervisor, eg kvm, ironic, xen         'img_hv_type': fields.HVTypeField(),          # Whether the image needs/expected config drive         'img_config_drive': fields.ConfigDrivePolicyField(),          # boolean flag to set space-saving or performance behavior on the         # Datastore         'img_linked_clone': fields.FlexibleBooleanField(),          # Image mappings - related to Block device mapping data - mapping         # of virtual image names to device names. This could be represented         # as a formal data type, but is left as dict for same reason as         # img_block_device_mapping field. It would arguably make sense for         # the two to be combined into a single field and data type in the         # future.         'img_mappings': fields.ListOfDictOfNullableStringsField(),          # image project id (set on upload)         'img_owner_id': fields.StringField(),          # root device name, used in snapshotting eg /dev/<blah>         'img_root_device_name': fields.StringField(),          # boolean - if false don't talk to nova agent         'img_use_agent': fields.FlexibleBooleanField(),          # integer value 1         'img_version': fields.IntegerField(),          # base64 of encoding of image signature         'img_signature': fields.StringField(),          # string indicating hash method used to compute image signature         'img_signature_hash_method': fields.ImageSignatureHashTypeField(),          # string indicating Castellan uuid of certificate         # used to compute the image's signature         'img_signature_certificate_uuid': fields.UUIDField(),          # string indicating type of key used to compute image signature         'img_signature_key_type': fields.ImageSignatureKeyTypeField(),          # string of username with admin privileges         'os_admin_user': fields.StringField(),          # string of boot time command line arguments for the guest kernel         'os_command_line': fields.StringField(),          # the name of the specific guest operating system distro. This         # is not done as an Enum since the list of operating systems is         # growing incredibly fast, and valid values can be arbitrarily         # user defined. Nova has no real need for strict validation so         # leave it freeform         'os_distro': fields.StringField(),          # boolean - if true, then guest must support disk quiesce         # or snapshot operation will be denied         'os_require_quiesce': fields.FlexibleBooleanField(),          # Secure Boot feature will be enabled by setting the "os_secure_boot"         # image property to "required". Other options can be: "disabled" or         # "optional".         # "os:secure_boot" flavor extra spec value overrides the image property         # value.         'os_secure_boot': fields.SecureBootField(),          # boolean - if using agent don't inject files, assume someone else is         # doing that (cloud-init)         'os_skip_agent_inject_files_at_boot': fields.FlexibleBooleanField(),          # boolean - if using agent don't try inject ssh key, assume someone         # else is doing that (cloud-init)         'os_skip_agent_inject_ssh': fields.FlexibleBooleanField(),          # The guest operating system family such as 'linux', 'windows' - this         # is a fairly generic type. For a detailed type consider os_distro         # instead         'os_type': fields.OSTypeField(),     }      # The keys are the legacy property names and     # the values are the current preferred names     _legacy_property_map = {         'architecture': 'hw_architecture',         'owner_id': 'img_owner_id',         'vmware_disktype': 'hw_disk_type',         'vmware_image_version': 'img_version',         'vmware_ostype': 'os_distro',         'auto_disk_config': 'hw_auto_disk_config',         'ipxe_boot': 'hw_ipxe_boot',         'xenapi_device_id': 'hw_device_id',         'xenapi_image_compression_level': 'img_compression_level',         'vmware_linked_clone': 'img_linked_clone',         'xenapi_use_agent': 'img_use_agent',         'xenapi_skip_agent_inject_ssh': 'os_skip_agent_inject_ssh',         'xenapi_skip_agent_inject_files_at_boot':             'os_skip_agent_inject_files_at_boot',         'cache_in_nova': 'img_cache_in_nova',         'vm_mode': 'hw_vm_mode',         'bittorrent': 'img_bittorrent',         'mappings': 'img_mappings',         'block_device_mapping': 'img_block_device_mapping',         'bdm_v2': 'img_bdm_v2',         'root_device_name': 'img_root_device_name',         'hypervisor_version_requires': 'img_hv_requested_version',         'hypervisor_type': 'img_hv_type',     } 
  • 主要是进行镜像与主机架构等属性的对比(nova/scheduler/filters/image_props_filter.py):
class ImagePropertiesFilter(filters.BaseHostFilter):     """Filter compute nodes that satisfy instance image properties.      The ImagePropertiesFilter filters compute nodes that satisfy     any architecture, hypervisor type, or virtual machine mode properties     specified on the instance's image properties.  Image properties are     contained in the image dictionary in the request_spec.     """      RUN_ON_REBUILD = True      # Image Properties and Compute Capabilities do not change within     # a request     run_filter_once_per_request = True      def _instance_supported(self, host_state, image_props,                             hypervisor_version):         img_arch = image_props.get('hw_architecture')         img_h_type = image_props.get('img_hv_type')         img_vm_mode = image_props.get('hw_vm_mode')         checked_img_props = (             fields.Architecture.canonicalize(img_arch),             fields.HVType.canonicalize(img_h_type),             fields.VMMode.canonicalize(img_vm_mode)         )          # Supported if no compute-related instance properties are specified         if not any(checked_img_props):             return True          supp_instances = host_state.supported_instances         # Not supported if an instance property is requested but nothing         # advertised by the host.         if not supp_instances:             LOG.debug("Instance contains properties %(image_props)s, "                         "but no corresponding supported_instances are "                         "advertised by the compute node",                       {'image_props': image_props})             return False          def _compare_props(props, other_props):             for i in props:                 if i and i not in other_props:                     return False             return True          def _compare_product_version(hyper_version, image_props):             version_required = image_props.get('img_hv_requested_version')             if not(hypervisor_version and version_required):                 return True             img_prop_predicate = versionpredicate.VersionPredicate(                 'image_prop (%s)' % version_required)             hyper_ver_str = versionutils.convert_version_to_str(hyper_version)             return img_prop_predicate.satisfied_by(hyper_ver_str)          for supp_inst in supp_instances:             if _compare_props(checked_img_props, supp_inst):                 if _compare_product_version(hypervisor_version, image_props):                     return True          LOG.debug("Instance contains properties %(image_props)s "                     "that are not provided by the compute node "                     "supported_instances %(supp_instances)s or "                     "hypervisor version %(hypervisor_version)s do not match",                   {'image_props': image_props,                    'supp_instances': supp_instances,                    'hypervisor_version': hypervisor_version})         return False      def host_passes(self, host_state, spec_obj):         """Check if host passes specified image properties.          Returns True for compute nodes that satisfy image properties         contained in the request_spec.         """         image_props = spec_obj.image.properties if spec_obj.image else {}          if not self._instance_supported(host_state, image_props,                                         host_state.hypervisor_version):             LOG.debug("%(host_state)s does not support requested "                         "instance_properties", {'host_state': host_state})             return False         return True 

####CoreFilter过滤器

  • CoreFilterAggregateCoreFilter过滤器源码(nova/scheduler/filters/core_filter.py):
class BaseCoreFilter(filters.BaseHostFilter):      RUN_ON_REBUILD = False      def _get_cpu_allocation_ratio(self, host_state, spec_obj):         raise NotImplementedError      def host_passes(self, host_state, spec_obj):         """Return True if host has sufficient CPU cores.          :param host_state: nova.scheduler.host_manager.HostState         :param spec_obj: filter options         :return: boolean         """         if not host_state.vcpus_total:             # Fail safe             LOG.warning(_LW("VCPUs not set; assuming CPU collection broken"))             return True          instance_vcpus = spec_obj.vcpus         cpu_allocation_ratio = self._get_cpu_allocation_ratio(host_state,                                                               spec_obj)         vcpus_total = host_state.vcpus_total * cpu_allocation_ratio          # Only provide a VCPU limit to compute if the virt driver is reporting         # an accurate count of installed VCPUs. (XenServer driver does not)         if vcpus_total > 0:             host_state.limits['vcpu'] = vcpus_total              # Do not allow an instance to overcommit against itself, only             # against other instances.             if instance_vcpus > host_state.vcpus_total:                 LOG.debug("%(host_state)s does not have %(instance_vcpus)d "                           "total cpus before overcommit, it only has %(cpus)d",                           {'host_state': host_state,                            'instance_vcpus': instance_vcpus,                            'cpus': host_state.vcpus_total})                 return False          free_vcpus = vcpus_total - host_state.vcpus_used         if free_vcpus < instance_vcpus:             LOG.debug("%(host_state)s does not have %(instance_vcpus)d "                       "usable vcpus, it only has %(free_vcpus)d usable "                       "vcpus",                       {'host_state': host_state,                        'instance_vcpus': instance_vcpus,                        'free_vcpus': free_vcpus})             return False          return True   class CoreFilter(BaseCoreFilter):     """CoreFilter filters based on CPU core utilization."""      def _get_cpu_allocation_ratio(self, host_state, spec_obj):         return host_state.cpu_allocation_ratio   class AggregateCoreFilter(BaseCoreFilter):     """AggregateCoreFilter with per-aggregate CPU subscription flag.      Fall back to global cpu_allocation_ratio if no per-aggregate setting found.     """      def _get_cpu_allocation_ratio(self, host_state, spec_obj):         aggregate_vals = utils.aggregate_values_from_key(             host_state,             'cpu_allocation_ratio')         try:             ratio = utils.validate_num_values(                 aggregate_vals, host_state.cpu_allocation_ratio, cast_to=float)         except ValueError as e:             LOG.warning(_LW("Could not decode cpu_allocation_ratio: '%s'"), e)             ratio = host_state.cpu_allocation_ratio          return ratio 
  • 验证参数(nova/scheduler/filters/utils.py):
def validate_num_values(vals, default=None, cast_to=int, based_on=min):     """Returns a correctly casted value based on a set of values.      This method is useful to work with per-aggregate filters, It takes     a set of values then return the 'based_on'{min/max} converted to     'cast_to' of the set or the default value.      Note: The cast implies a possible ValueError     """     num_values = len(vals)     if num_values == 0:         return default      if num_values > 1:         if based_on == min:             LOG.info(_LI("%(num_values)d values found, "                          "of which the minimum value will be used."),                      {'num_values': num_values})         else:             LOG.info(_LI("%(num_values)d values found, "                          "of which the maximum value will be used."),                      {'num_values': num_values})     return based_on([cast_to(val) for val in vals]) 

当节点所属的多个Host Aggregate 设置cpu_allocation_ratio`参数时,取较小值。

##容灾备份

节点划分

  • Region: 主要用于对集群的物理位置划分,每个 Region 有自己独立的EndPoint,Regions 之间完全隔离,但是多个 Regions 之间共享同一个 KeyStone 和DashBoard。

  • Availability Zone: 可以简单理解为一组节点的集合,这组节点具有独立的电力供应设备,比如一个独立供电的机房,一个独立供电的机架。

  • Host Aggregate: 主要用于管理员根据节点的属性来对硬件进行划分,只对管理员可见。

  • Cell: 主要用来解决OpenStack扩展性和规模瓶颈,对DataBase和AMQP等组件进行分割,实现分级调度。

总结:**

  1. 可以使用 RegionAvailability Zone 来指定实例部署位置,并把对应的功能暴露给用户。也可以自行管理,向用户提供各种灾备选项。
  2. 可以使用 Cell 功能来划分集群,增强集群的横向扩展能力。
  3. 可以使用 Host Aggregate 功能来对主机的属性进行归类,也可以配合 AggregateInstanceExtraSpecsFilter 过滤器对主机进行不同调度策略的归类,再加上 flavor 的元数据就可以在一个集群中同时支持多种调度策略。
  4. 可以使用 ServerGroupAntiAffinityFilterServerGroupAffinityFilter 插件,使用 --hint 参数对虚拟机进行分组部署。

####多区域

  • 多区域Region,使用--os-region-name参数指定:
$ nova --help ...  --os-region-name <region-name>                                 Defaults to env[OS_REGION_NAME].  ... 

####集合

目前在命令中,可用区和集合是使用同一类命令进行管理的。

  • 多机房Availability Zone(AvailabilityZoneFilter),多机架Host Aggregate(AggregateInstanceExtraSpecsFilter),配合元数据,使用如下命令进行管理:
$ nova --help ...      aggregate-add-host          Add the host to the specified aggregate.     aggregate-create            Create a new aggregate with the specified                                 details.     aggregate-delete            Delete the aggregate.     aggregate-list              Print a list of all aggregates.     aggregate-remove-host       Remove the specified host from the specified                                 aggregate.     aggregate-set-metadata      Update the metadata associated with the                                 aggregate.     aggregate-show              Show details of the specified aggregate.     aggregate-update            Update the aggregate's name and optionally                                 availability zone.     availability-zone-list      List all the availability zones.  ... 

###过滤器配置

新增AvailabilityZoneFilter过滤器:

$ vi /etc/kolla/nova-scheduler/nova.conf [DEFAULT] ... scheduler_default_filters = AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter ...  $ docker restart nova_scheduler 

调度测试

####可用区配置

  • 这里假设有一个区域(Region)有2个机房,每个机房分别有2个机架:
主机          机房     机架       机架属性 osdev-01     az01   az01-ha01    addr=11 osdev-02     az01   az01-ha02    addr=12 osdev-03     az02   az02-ha01    addr=21 osdev-ceph   az02   az02-ha02    addr=22 osdev-gpu    az02   az02-ha02    addr=22 
  • 查看当前的Availability Zone
$ nova availability-zone-list +-----------------------+----------------------------------------+ | Name                  | Status                                 | +-----------------------+----------------------------------------+ | internal              | available                              | | |- osdev-01           |                                        | | | |- nova-conductor   | enabled :-) 2018-03-15T09:51:30.000000 | | | |- nova-scheduler   | enabled :-) 2018-03-15T09:51:30.000000 | | | |- nova-consoleauth | enabled :-) 2018-03-15T09:51:31.000000 | | nova                  | available                              | | |- osdev-01           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T09:51:30.000000 | | |- osdev-02           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T09:51:25.000000 | | |- osdev-03           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T09:51:31.000000 | | |- osdev-ceph         |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T09:51:28.000000 | | |- osdev-gpu          |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T09:51:26.000000 | +-----------------------+----------------------------------------+ 
  • 分别创建2个Availability Zone和4个Host Aggregate
$ nova aggregate-create az01-ha01 az01 +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | 3  | az01-ha01 | az01              |       | 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+  $ nova aggregate-create az01-ha02 az01 +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | 4  | az01-ha02 | az01              |       | 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+  $ nova aggregate-create az02-ha01 az02 +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | 5  | az02-ha01 | az02              |       | 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+  $ nova aggregate-create az02-ha02 az02 +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts | Metadata                 | UUID                                 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ | 6  | az02-ha02 | az02              |       | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 | +----+-----------+-------------------+-------+--------------------------+--------------------------------------+ 
  • 分别把5个节点加入4个Host Agreegate中:
$ nova aggregate-add-host az01-ha01 osdev-01 Host osdev-01 has been successfully added for aggregate 3  +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | 3  | az01-ha01 | az01              | 'osdev-01' | 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+  $ nova aggregate-add-host az01-ha02 osdev-02 Host osdev-02 has been successfully added for aggregate 4  +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | 4  | az01-ha02 | az01              | 'osdev-02' | 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+  $ nova aggregate-add-host az02-ha01 osdev-03 Host osdev-03 has been successfully added for aggregate 5  +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                 | UUID                                 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+ | 5  | az02-ha01 | az02              | 'osdev-03' | 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 | +----+-----------+-------------------+------------+--------------------------+--------------------------------------+  $ nova aggregate-add-host az02-ha02 osdev-ceph Host osdev-ceph has been successfully added for aggregate 6  +----+-----------+-------------------+--------------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts        | Metadata                 | UUID                                 | +----+-----------+-------------------+--------------+--------------------------+--------------------------------------+ | 6  | az02-ha02 | az02              | 'osdev-ceph' | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 | +----+-----------+-------------------+--------------+--------------------------+--------------------------------------+  $ nova aggregate-add-host az02-ha02 osdev-gpu Host osdev-gpu has been successfully added for aggregate 6  +----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts                     | Metadata                 | UUID                                 | +----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+ | 6  | az02-ha02 | az02              | 'osdev-ceph', 'osdev-gpu' | 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 | +----+-----------+-------------------+---------------------------+--------------------------+--------------------------------------+ 
  • 查看当前的Availability Zone
$ nova availability-zone-list +-----------------------+----------------------------------------+ | Name                  | Status                                 | +-----------------------+----------------------------------------+ | internal              | available                              | | |- osdev-01           |                                        | | | |- nova-conductor   | enabled :-) 2018-03-15T10:09:10.000000 | | | |- nova-scheduler   | enabled :-) 2018-03-15T10:09:10.000000 | | | |- nova-consoleauth | enabled :-) 2018-03-15T10:09:01.000000 | | az02                  | available                              | | |- osdev-03           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T10:09:01.000000 | | |- osdev-ceph         |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T10:09:08.000000 | | |- osdev-gpu          |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T10:09:05.000000 | | az01                  | available                              | | |- osdev-01           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T10:09:10.000000 | | |- osdev-02           |                                        | | | |- nova-compute     | enabled :-) 2018-03-15T10:09:05.000000 | +-----------------------+----------------------------------------+  

####指定可用区

  • az02上创建一个实(被调度到osdev-gpu上):
$ openstack network list +--------------------------------------+----------+--------------------------------------+ | ID                                   | Name     | Subnets                              | +--------------------------------------+----------+--------------------------------------+ | 8ab35b74-d680-4cfc-8c61-810965e3992e | public1  | 2e6f24b8-3482-4e68-9d61-6306ff1da8a2 | | 8d01509e-4a3a-497a-9118-3827c1e37672 | demo-net | 3b817b11-8fda-485f-bad9-0b7e30534d66 | +--------------------------------------+----------+--------------------------------------+  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az02 demo1   $ openstack server list --long --column "Name" --column "Status" --column "Flavor Name" --column "Networks" --column "Availability Zone" --column "Host" +-------+--------+--------------------+-------------------+-----------+ | Name  | Status | Networks           | Availability Zone | Host      | +-------+--------+--------------------+-------------------+-----------+ | demo1 | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu | +-------+--------+--------------------+-------------------+-----------+ 
  • az01上创建一个实(被调度到osdev-02上):
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01 demo2  $ openstack server list --long --column "Name" --column "Status" --column "Flavor Name" --column "Networks" --column "Availability Zone" --column "Host" +-------+--------+--------------------+-------------------+-----------+ | Name  | Status | Networks           | Availability Zone | Host      | +-------+--------+--------------------+-------------------+-----------+ | demo2 | ACTIVE | demo-net=10.0.0.3  | az01              | osdev-02  | | demo1 | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu | +-------+--------+--------------------+-------------------+-----------+ 

####集合配置

  • 配置机架属性:
$ nova aggregate-set-metadata az01-ha01 addr=11 Metadata has been successfully updated for aggregate 3. +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | 3  | az01-ha01 | az01              | 'osdev-01' | 'addr=11', 'availability_zone=az01' | 3baac65d-2907-412a-98f5-60e582612548 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+  $ nova aggregate-set-metadata az01-ha02 addr=12 Metadata has been successfully updated for aggregate 4. +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | 4  | az01-ha02 | az01              | 'osdev-02' | 'addr=12', 'availability_zone=az01' | 5ea0f221-024d-43f5-b1f1-6e0cc364ad39 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+  $ nova aggregate-set-metadata az02-ha01 addr=21 Metadata has been successfully updated for aggregate 5. +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts      | Metadata                            | UUID                                 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+ | 5  | az02-ha01 | az02              | 'osdev-03' | 'addr=21', 'availability_zone=az02' | 00cdcefa-bcd0-490c-bdee-9cc970268a03 | +----+-----------+-------------------+------------+-------------------------------------+--------------------------------------+  $ nova aggregate-set-metadata az02-ha02 addr=22 Metadata has been successfully updated for aggregate 6. +----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+ | Id | Name      | Availability Zone | Hosts                     | Metadata                            | UUID                                 | +----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+ | 6  | az02-ha02 | az02              | 'osdev-ceph', 'osdev-gpu' | 'addr=22', 'availability_zone=az02' | 81de444f-4730-471a-bff5-ff11372c3096 | +----+-----------+-------------------+---------------------------+-------------------------------------+--------------------------------------+ 

####模板配置

  • 创建新的flavor
$ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az01-ha01 +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | disk                       | 1                                    | | id                         | 0c3bb453-146f-4093-b161-39c10978f0eb | | name                       | machine.az01-ha01                    | | os-flavor-access:is_public | True                                 | | properties                 |                                      | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+  $ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az01-ha02 +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | disk                       | 1                                    | | id                         | ba9a94aa-f841-4529-8d34-e3e9e8484f90 | | name                       | machine.az01-ha02                    | | os-flavor-access:is_public | True                                 | | properties                 |                                      | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+  $ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az02-ha01 +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | disk                       | 1                                    | | id                         | 326d7246-9d6a-4b73-8e89-83565887ada7 | | name                       | machine.az02-ha01                    | | os-flavor-access:is_public | True                                 | | properties                 |                                      | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+  $ openstack flavor create --vcpus 1 --ram 64 --disk 1 machine.az02-ha02 +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | disk                       | 1                                    | | id                         | f3fb1e5f-adcd-402c-9125-56dda997b52a | | name                       | machine.az02-ha02                    | | os-flavor-access:is_public | True                                 | | properties                 |                                      | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 1                                    | +----------------------------+--------------------------------------+ 
  • flavor新增addr元数据:
$ nova flavor-key machine.az01-ha01 set addr=11 $ nova flavor-key machine.az01-ha02 set addr=12 $ nova flavor-key machine.az02-ha01 set addr=21 $ nova flavor-key machine.az02-ha02 set addr=22  $ openstack flavor list --long --column "Name" --column "Properties" +-------------------+------------+ | Name              | Properties | +-------------------+------------+ | machine.az01-ha01 | addr='11'  | | m1.tiny           |            | | m1.small          |            | | m1.medium         |            | | machine.az02-ha01 | addr='21'  | | m1.large          |            | | m1.xlarge         |            | | machine.az01-ha02 | addr='12'  | | machine.az02-ha02 | addr='22'  | +-------------------+------------+ 

####指定集合

使用带有addr元数据的flavor创建虚拟机:

$ openstack server create --image cirros --flavor machine.az01-ha01 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az01-ha01  $ openstack server create --image cirros --flavor machine.az01-ha02 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az01-ha02  $ openstack server create --image cirros --flavor machine.az02-ha01 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az02-ha01  $ openstack server create --image cirros --flavor machine.az02-ha02 --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 server.az02-ha02   $ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +------------------+--------+--------------------+-------------------+-----------+ | Name             | Status | Networks           | Availability Zone | Host      | +------------------+--------+--------------------+-------------------+-----------+ | server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu | | server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  | | server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  | | server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  | +------------------+--------+--------------------+-------------------+-----------+ 

###相关源码

  • 可用区过滤器源码(nova/scheduler/filters/availability_zone_filter.py):
class AvailabilityZoneFilter(filters.BaseHostFilter):     """Filters Hosts by availability zone.      Works with aggregate metadata availability zones, using the key     'availability_zone'     Note: in theory a compute node can be part of multiple availability_zones     """      # Availability zones do not change within a request     run_filter_once_per_request = True      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         availability_zone = spec_obj.availability_zone          if not availability_zone:             return True          metadata = utils.aggregate_metadata_get_by_host(                 host_state, key='availability_zone')          if 'availability_zone' in metadata:             hosts_passes = availability_zone in metadata['availability_zone']             host_az = metadata['availability_zone']         else:             hosts_passes = availability_zone == CONF.default_availability_zone             host_az = CONF.default_availability_zone          if not hosts_passes:             LOG.debug("Availability Zone '%(az)s' requested. "                       "%(host_state)s has AZs: %(host_az)s",                       {'host_state': host_state,                        'az': availability_zone,                        'host_az': host_az})          return hosts_passes 

可以看到Availability Zone的判断,首先比较节点的元数据,如果不存在则使用节点的默认配置。

  • 获取节点的元数据(nova/scheduler/filters/utils.py):
def aggregate_metadata_get_by_host(host_state, key=None):     """Returns a dict of all metadata based on a metadata key for a specific     host. If the key is not provided, returns a dict of all metadata.     """     aggrlist = host_state.aggregates     metadata = collections.defaultdict(set)     for aggr in aggrlist:         if key is None or key in aggr.metadata:             for k, v in aggr.metadata.items():                 metadata[k].update(x.strip() for x in v.split(','))     return metadata 

##亲和性

过滤器配置

  • 新增SameHostFilterDifferentHostFilterServerGroupAntiAffinityFilterServerGroupAffinityFilter过滤器:
$ vi /etc/kolla/nova-scheduler/nova.conf [DEFAULT] ... scheduler_default_filters = ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, SameHostFilter, DifferentHostFilter, AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter ...  $ docker restart nova_scheduler 

亲和性测试

####相同主机

  • 在实例server.az02-ha02所在节点上新建实例:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint same_host=6a928cc0-1509-4e00-91c8-6b43ceb05373 server.same   $ openstack server list --long --column "ID" --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ | ID                                   | Name             | Status | Networks           | Availability Zone | Host      | +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ | aadc3d24-965f-437d-af40-70adee984cad | server.same      | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-gpu | | 6a928cc0-1509-4e00-91c8-6b43ceb05373 | server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu | | 3605afc0-7a8e-4ae9-a0cf-e0df64f6bfd6 | server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  | | cdec034a-bdca-4651-88d5-c34d17ea12f1 | server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  | | 276a4af1-762b-43c9-a064-7f1b27d46356 | server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  | +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ 

####不同主机

在实例server.az02-ha02以外的节点上新建实例:

$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint different_host=6a928cc0-1509-4e00-91c8-6b43ceb05373 server.different  $ openstack server list --long --column "ID" --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ | ID                                   | Name             | Status | Networks           | Availability Zone | Host      | +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ | d652614a-ce1b-4314-8c13-33c77527950d | server.different | ACTIVE | demo-net=10.0.0.8  | az02              | osdev-03  | | aadc3d24-965f-437d-af40-70adee984cad | server.same      | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-gpu | | 6a928cc0-1509-4e00-91c8-6b43ceb05373 | server.az02-ha02 | ACTIVE | demo-net=10.0.0.15 | az02              | osdev-gpu | | 3605afc0-7a8e-4ae9-a0cf-e0df64f6bfd6 | server.az02-ha01 | ACTIVE | demo-net=10.0.0.12 | az02              | osdev-03  | | cdec034a-bdca-4651-88d5-c34d17ea12f1 | server.az01-ha02 | ACTIVE | demo-net=10.0.0.6  | az01              | osdev-02  | | 276a4af1-762b-43c9-a064-7f1b27d46356 | server.az01-ha01 | ACTIVE | demo-net=10.0.0.4  | az01              | osdev-01  | +--------------------------------------+------------------+--------+--------------------+-------------------+-----------+ 

####分散调度

  • 创建实例分组:
$ openstack server group create --policy anti-affinity group-anti-affinity +----------+--------------------------------------+ | Field    | Value                                | +----------+--------------------------------------+ | id       | 855ea22c-d369-4e90-a7b1-318064c72b16 | | members  |                                      | | name     | group-anti-affinity                  | | policies | anti-affinity                        | +----------+--------------------------------------+ 
  • 创建虚拟机:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa1  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa2  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa3  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa4  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa5  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa6  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa7  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=855ea22c-d369-4e90-a7b1-318064c72b16 server.aa8 
  • 查看调度结果:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +--------------------+--------+--------------------+-------------------+------------+ | Name               | Status | Networks           | Availability Zone | Host       | +--------------------+--------+--------------------+-------------------+------------+ | server.aa8         | ERROR  |                    |                   | None       | | server.aa7         | ERROR  |                    |                   | None       | | server.aa6         | ERROR  |                    |                   | None       | | server.aa5         | ACTIVE | demo-net=10.0.0.10 | az02              | osdev-ceph | | server.aa4         | ACTIVE | demo-net=10.0.0.7  | az01              | osdev-01   | | server.aa3         | ACTIVE | demo-net=10.0.0.18 | az01              | osdev-02   | | server.aa2         | ACTIVE | demo-net=10.0.0.4  | az02              | osdev-03   | | server.aa1         | ACTIVE | demo-net=10.0.0.11 | az02              | osdev-gpu  | +--------------------+--------+--------------------+-------------------+------------+ 

####聚合调度

  • 创建实例分组:
$ openstack server group create --policy affinity group-affinity +----------+--------------------------------------+ | Field    | Value                                | +----------+--------------------------------------+ | id       | d44f51d3-676e-4bca-ac56-76e998da9467 | | members  |                                      | | name     | group-affinity                       | | policies | affinity                             | +----------+--------------------------------------+ 
  • 创建虚拟机:
$ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a1  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a2  $ openstack server create --image cirros --flavor m1.tiny --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --hint group=d44f51d3-676e-4bca-ac56-76e998da9467 server.a3 
  • 查看调度结果:
$ openstack server list --long --column "Name" --column "Status" --column "Networks" --column "Availability Zone" --column "Host" +------------------+--------+--------------------+-------------------+------------+ | Name             | Status | Networks           | Availability Zone | Host       | +------------------+--------+--------------------+-------------------+------------+ | server.a3        | ACTIVE | demo-net=10.0.0.24 | az02              | osdev-gpu  | | server.a2        | ACTIVE | demo-net=10.0.0.20 | az02              | osdev-gpu  | | server.a1        | ACTIVE | demo-net=10.0.0.19 | az02              | osdev-gpu  | +------------------+--------+--------------------+-------------------+------------+ 

###相关源码

  • 判断虚拟机是否在该节点(nova/scheduler/filters/utils.py):
def instance_uuids_overlap(host_state, uuids):     """Tests for overlap between a host_state and a list of uuids.      Returns True if any of the supplied uuids match any of the instance.uuid     values in the host_state.     """     if isinstance(uuids, six.string_types):         uuids = [uuids]     set_uuids = set(uuids)     # host_state.instances is a dict whose keys are the instance uuids     host_uuids = set(host_state.instances.keys())     return bool(host_uuids.intersection(set_uuids)) 
  • 相同主机过滤器(nova/scheduler/filters/affinity_filter.py):
class SameHostFilter(filters.BaseHostFilter):     """Schedule the instance on the same host as another instance in a set of     instances.     """     # The hosts the instances are running on doesn't change within a request     run_filter_once_per_request = True      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         affinity_uuids = spec_obj.get_scheduler_hint('same_host')         if affinity_uuids:             overlap = utils.instance_uuids_overlap(host_state, affinity_uuids)             return overlap         # With no same_host key         return True 
  • 不同主机过滤器(nova/scheduler/filters/affinity_filter.py):
class DifferentHostFilter(filters.BaseHostFilter):     """Schedule the instance on a different host from a set of instances."""     # The hosts the instances are running on doesn't change within a request     run_filter_once_per_request = True      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         affinity_uuids = spec_obj.get_scheduler_hint('different_host')         if affinity_uuids:             overlap = utils.instance_uuids_overlap(host_state, affinity_uuids)             return not overlap         # With no different_host key         return True 
  • 分散过滤器源码(nova/scheduler/filters/affinity_filter.py):
class _GroupAntiAffinityFilter(filters.BaseHostFilter):     """Schedule the instance on a different host from a set of group     hosts.     """      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         # Only invoke the filter if 'anti-affinity' is configured         policies = (spec_obj.instance_group.policies                     if spec_obj.instance_group else [])         if self.policy_name not in policies:             return True         # NOTE(hanrong): Move operations like resize can check the same source         # compute node where the instance is. That case, AntiAffinityFilter         # must not return the source as a non-possible destination.         if spec_obj.instance_uuid in host_state.instances.keys():             return True          group_hosts = (spec_obj.instance_group.hosts                        if spec_obj.instance_group else [])         LOG.debug("Group anti affinity: check if %(host)s not "                   "in %(configured)s", {'host': host_state.host,                                         'configured': group_hosts})         if group_hosts:             return host_state.host not in group_hosts          # No groups configured         return True   class ServerGroupAntiAffinityFilter(_GroupAntiAffinityFilter):     def __init__(self):         self.policy_name = 'anti-affinity'         super(ServerGroupAntiAffinityFilter, self).__init__() 

不能在一个分组中重复调度虚拟机。

  • 聚合过滤器源码(nova/scheduler/filters/affinity_filter.py):
class _GroupAffinityFilter(filters.BaseHostFilter):     """Schedule the instance on to host from a set of group hosts.     """      RUN_ON_REBUILD = False      def host_passes(self, host_state, spec_obj):         # Only invoke the filter if 'affinity' is configured         policies = (spec_obj.instance_group.policies                     if spec_obj.instance_group else [])         if self.policy_name not in policies:             return True          group_hosts = (spec_obj.instance_group.hosts                        if spec_obj.instance_group else [])         LOG.debug("Group affinity: check if %(host)s in "                   "%(configured)s", {'host': host_state.host,                                      'configured': group_hosts})         if group_hosts:             return host_state.host in group_hosts          # No groups configured         return True   class ServerGroupAffinityFilter(_GroupAffinityFilter):     def __init__(self):         self.policy_name = 'affinity'         super(ServerGroupAffinityFilter, self).__init__() 

##NUMA绑定

###参数配置

  • 为Flavor添加元数据,即extra-specs,通过设置以下几种关键字:
hw:numa_nodes=N                         - VM中NUMA的个数  hw:numa_mempolicy=preferred|strict      - VM中 NUMA 内存的使用策略  hw:numa_cpus.0=<cpu-list>               - VM 中在NUMA node 0的cpu  hw:numa_cpus.1=<cpu-list>               - VM 中在NUMA node 1的cpu  hw:numa_mem.0=<ram-size>                - VM 中在NUMA node 0的内存大小(M)  hw:numa_mem.1=<ram-size>                - VM 中在NUMA node 1的内存大小(M) 
  • 为Image添加元数据,即Image的metadata,通过设置以下几种关键字:
hw_numa_nodes=N                        - numa of NUMA nodes to expose to the guest.  hw_numa_mempolicy=preferred|strict     - memory allocation policy  hw_numa_cpus.0=<cpu-list>              - mapping of vCPUS N-M to NUMA node 0  hw_numa_cpus.1=<cpu-list>              - mapping of vCPUS N-M to NUMA node 1  hw_numa_mem.0=<ram-size>               - mapping N MB of RAM to NUMA node 0  hw_numa_mem.1=<ram-size>               - mapping N MB of RAM to NUMA node 1 
  • 各个字段表示的含义如下:
numa_nodes:该虚拟机包含的 NUMA 节点个数;  numa_cpus.0:虚拟机上 NUMA 节点 0 包含的虚拟机 CPU 的 ID,格式"1-4,6",如果用户自己指定部署方式,则需要指定虚拟机内每个 NUMA 节点的 CPU 部署信息,所有 NUMA 节点上的 CPU 总和需要与套餐中 vcpus 的总数一致;  numa_mem.0:虚拟机上 NUMA 节点 0 包含的内存大小,单位 M,如果用户自己指定部署方式,则需要指定虚拟机内每个 NUMA 节点的内存信息,所有 NUMA 节点上的内存总和需要等于套餐中的 memory_mb 大小。  

注意:

N 是虚拟机 NUMA 节点的索引,并不一定对应主机 NUMA 节点。 例如,在两个 NUMA 节点的平台,根据 hw:numa_mem.0,调度会选择虚拟机 NUMA 节点 0,但是却是在主机 NUMA 节点 1 上,反之亦然。类似的,FLAVOR-CORES 也是虚拟机 vCPU 的编号,并不对应与主机 CPU。因此,这个特性不能用来约束虚拟机所处的主机 CPU 与 NUMA 节点。 

警告:

如果 hw:numa_cpus.N 或 hw:numa_mem.N 的值比可用 CPU 或内存大,则会引发错误。 
  • 自动分配 NUMA 的约束和限制:
1、不能设置 numa_cpus 和 numa_mem;  2、自动从 0 节点开始平均分配。 
  • 手动指定 NUMA 的约束和限制 :
1、用户指定的 CPU 总数需要与套餐中的 CPU 个数一致;  2、用户指定的内存总数需要与套餐中的内存总数一致;  3、必须设置 numa_cpus 和 numa_mem;  4、需要从 0 开始指定各个 numa 节点的资源占用 。 

###过滤器配置

新增NUMATopologyFilter过滤器:

$ vi /etc/kolla/nova-scheduler/nova.conf [DEFAULT] ... scheduler_default_filters = NUMATopologyFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, SameHostFilter, DifferentHostFilter, AggregateInstanceExtraSpecsFilter, RetryFilter, AvailabilityZoneFilter, RamFilter, DiskFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter ...  $ docker restart nova_scheduler 

###CPU查看

全局信息

  • libvirt中查看:
$ virsh nodeinfo CPU 型号:        x86_64 CPU:               72 CPU 频率:        1963 MHz CPU socket:        1 每个 socket 的内核数: 18 每个内核的线程数: 2 NUMA 单元:       2 内存大小:      401248288 KiB 
  • numactl中查看(需安装numactl软件包):
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 node 0 size: 195236 MB node 0 free: 153414 MB node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 196608 MB node 1 free: 167539 MB node distances: node   0   1    0:  10  21    1:  21  10  $ numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  cpubind: 0 1  nodebind: 0 1  membind: 0 1 
  • 查看NUMA内存分配情况(等同于cat /sys/devices/system/node/node0/numastat命令):
$ numastat                            node0           node1 numa_hit             25692061493     30097928824 numa_miss                      0               0 numa_foreign                   0               0 interleave_hit            110618          109691 local_node           25689725277     30096685088 other_node               2336216         1243736 

numa_hit是打算在该节点上分配内存,最后从这个节点分配的次数;

num_miss是打算在该节点分配内存,最后却从其他节点分配的次数(此数值偏高时说明要调整分配策略);

num_foregin是打算在其他节点分配内存,最后却从这个节点分配的次数;

interleave_hit是采用interleave策略最后从该节点分配的次数;

local_node该节点上的进程在该节点上分配的次数;

other_node是其他节点进程在该节点上分配的次数。

  • 查看个CPU负载情况(需安装sysstat软件包):
$ mpstat -P ALL Linux 3.10.0-693.17.1.el7.x86_64 (osdev-01) 	2018年03月19日 	_x86_64_	(72 CPU)  16时28分12秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle 16时28分12秒  all    2.38    8.79    0.70    0.00    0.00    0.01    0.00    0.00    0.00   88.11 16时28分12秒    0    8.83    3.29    1.50    0.00    0.00    0.30    0.00    0.00    0.00   86.09 16时28分12秒    1    7.54    2.88    1.29    0.00    0.00    0.05    0.00    0.00    0.00   88.24 16时28分12秒    2    7.60    2.89    1.30    0.00    0.00    0.03    0.00    0.00    0.00   88.18 16时28分12秒    3    6.97    2.93    1.20    0.00    0.00    0.02    0.00    0.00    0.00   88.87 16时28分12秒    4    3.91    5.84    0.87    0.00    0.00    0.02    0.00    0.00    0.00   89.36 16时28分12秒    5    3.95    5.64    0.88    0.00    0.00    0.01    0.00    0.00    0.00   89.52 16时28分12秒    6    3.14    7.24    0.80    0.00    0.00    0.01    0.00    0.00    0.00   88.80 16时28分12秒    7    2.38    8.41    0.74    0.01    0.00    0.01    0.00    0.00    0.00   88.46 16时28分12秒    8    2.34    9.37    0.76    0.01    0.00    0.01    0.00    0.00    0.00   87.53 16时28分12秒    9    2.19    9.73    0.75    0.01    0.00    0.01    0.00    0.00    0.00   87.32 16时28分12秒   10    2.12   10.09    0.75    0.01    0.00    0.01    0.00    0.00    0.00   87.03 16时28分12秒   11    2.10   10.44    0.74    0.01    0.00    0.01    0.00    0.00    0.00   86.71 16时28分12秒   12    2.04   10.72    0.75    0.01    0.00    0.01    0.00    0.00    0.00   86.48 16时28分12秒   13    2.00   11.14    0.76    0.01    0.00    0.01    0.00    0.00    0.00   86.10 16时28分12秒   14    2.00   11.52    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.70 16时28分12秒   15    1.97   11.80    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.45 16时28分12秒   16    1.97   12.03    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.22 16时28分12秒   17    1.96   12.20    0.76    0.01    0.00    0.01    0.00    0.00    0.00   85.06 16时28分12秒   18    2.17   17.85    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.00 16时28分12秒   19    2.55   17.04    0.99    0.01    0.00    0.00    0.00    0.00    0.00   79.41 16时28分12秒   20    2.72   17.36    1.07    0.01    0.00    0.00    0.00    0.00    0.00   78.84 16时28分12秒   21    2.61   17.27    1.03    0.01    0.00    0.00    0.00    0.00    0.00   79.07 16时28分12秒   22    2.37   17.39    1.00    0.01    0.00    0.00    0.00    0.00    0.00   79.23 16时28分12秒   23    2.16   17.50    0.99    0.01    0.00    0.00    0.00    0.00    0.00   79.35 16时28分12秒   24    2.05   17.52    0.98    0.01    0.00    0.00    0.00    0.00    0.00   79.44 16时28分12秒   25    1.98   17.52    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.53 16时28分12秒   26    1.93   17.47    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.62 16时28分12秒   27    1.89   17.50    0.97    0.01    0.00    0.00    0.00    0.00    0.00   79.63 16时28分12秒   28    1.85   17.51    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.66 16时28分12秒   29    1.83   17.50    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.70 16时28分12秒   30    1.82   17.46    0.96    0.01    0.00    0.00    0.00    0.00    0.00   79.76 16时28分12秒   31    1.78   17.42    0.95    0.01    0.00    0.00    0.00    0.00    0.00   79.83 16时28分12秒   32    1.79   17.36    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.90 16时28分12秒   33    1.77   17.34    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.94 16时28分12秒   34    1.75   17.33    0.94    0.01    0.00    0.00    0.00    0.00    0.00   79.97 16时28分12秒   35    1.73   17.30    0.94    0.01    0.00    0.00    0.00    0.00    0.00   80.02 16时28分12秒   36    4.90    3.96    0.61    0.00    0.00    0.00    0.00    0.00    0.00   90.53 16时28分12秒   37    9.35    6.24    1.46    0.00    0.00    0.00    0.00    0.00    0.00   82.95 16时28分12秒   38    5.90    4.43    0.87    0.00    0.00    0.00    0.00    0.00    0.00   88.79 16时28分12秒   39    6.10    3.83    0.81    0.00    0.00    0.00    0.00    0.00    0.00   89.25 16时28分12秒   40    2.95    3.55    0.53    0.00    0.00    0.00    0.00    0.00    0.00   92.95 16时28分12秒   41    2.28    3.48    0.44    0.00    0.00    0.00    0.00    0.00    0.00   93.79 16时28分12秒   42    1.44    3.34    0.35    0.00    0.00    0.00    0.00    0.00    0.00   94.87 16时28分12秒   43    0.94    3.24    0.30    0.00    0.00    0.00    0.00    0.00    0.00   95.52 16时28分12秒   44    0.89    3.41    0.29    0.00    0.00    0.00    0.00    0.00    0.00   95.41 16时28分12秒   45    0.86    3.43    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.42 16时28分12秒   46    0.83    3.44    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.45 16时28分12秒   47    0.84    3.43    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.46 16时28分12秒   48    0.83    3.50    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.40 16时28分12秒   49    0.88    3.31    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.53 16时28分12秒   50    0.91    3.16    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.65 16时28分12秒   51    0.90    3.15    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.68 16时28分12秒   52    0.89    3.19    0.27    0.00    0.00    0.00    0.00    0.00    0.00   95.65 16时28分12秒   53    0.91    3.40    0.28    0.00    0.00    0.00    0.00    0.00    0.00   95.40 16时28分12秒   54    1.00    5.91    0.37    0.00    0.00    0.00    0.00    0.00    0.00   92.71 16时28分12秒   55    5.70    7.12    1.16    0.00    0.00    0.00    0.00    0.00    0.00   86.01 16时28分12秒   56    2.49    6.33    0.69    0.00    0.00    0.00    0.00    0.00    0.00   90.49 16时28分12秒   57    2.93    6.12    0.66    0.00    0.00    0.00    0.00    0.00    0.00   90.29 16时28分12秒   58    2.15    6.06    0.60    0.00    0.00    0.00    0.00    0.00    0.00   91.19 16时28分12秒   59    1.52    6.15    0.56    0.00    0.00    0.00    0.00    0.00    0.00   91.77 16时28分12秒   60    1.23    6.25    0.48    0.00    0.00    0.00    0.00    0.00    0.00   92.03 16时28分12秒   61    1.11    5.68    0.42    0.00    0.00    0.00    0.00    0.00    0.00   92.79 16时28分12秒   62    1.03    5.67    0.41    0.00    0.00    0.00    0.00    0.00    0.00   92.89 16时28分12秒   63    1.01    5.72    0.39    0.00    0.00    0.00    0.00    0.00    0.00   92.88 16时28分12秒   64    0.96    5.62    0.38    0.00    0.00    0.00    0.00    0.00    0.00   93.02 16时28分12秒   65    0.94    5.50    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.18 16时28分12秒   66    0.93    5.50    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.20 16时28分12秒   67    0.92    5.53    0.37    0.00    0.00    0.00    0.00    0.00    0.00   93.17 16时28分12秒   68    0.91    5.52    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.21 16时28分12秒   69    0.91    5.54    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.18 16时28分12秒   70    0.89    5.61    0.36    0.00    0.00    0.00    0.00    0.00    0.00   93.13 16时28分12秒   71    0.89    5.94    0.37    0.00    0.00    0.00    0.00    0.00    0.00   92.79 

####单项查看

  • 查看Socket个数:
$ grep 'physical id' /proc/cpuinfo | awk -F: '{print $2 | "sort -un"}' | wc -l 2 

一共有2个Socket

  • 查看每个Socket包含的Processor个数:
$ grep 'physical id' /proc/cpuinfo | awk -F: '{print $2}' | sort | uniq -c      36  0      36  1 

每个Socket有36个Processor

  • 查看每个Socket包含的Core个数:
cat /proc/cpuinfo | grep 'core'  | sort -u core id		: 0 core id		: 1 core id		: 10 core id		: 11 core id		: 16 core id		: 17 core id		: 18 core id		: 19 core id		: 2 core id		: 20 core id		: 24 core id		: 25 core id		: 26 core id		: 27 core id		: 3 core id		: 4 core id		: 8 core id		: 9 cpu cores	: 18 

每个Socket包含18个Core(每个Core包含2个Processor)。

####结构分析

  • CPU结构分析脚本:
#!/bin/bash  # Simple print cpu topology # Author: kodango  function get_nr_processor() {     grep '^processor' /proc/cpuinfo | wc -l }  function get_nr_socket() {     grep 'physical id' /proc/cpuinfo | awk -F: '{             print $2 | "sort -un"}' | wc -l }  function get_nr_siblings() {     grep 'siblings' /proc/cpuinfo | awk -F: '{             print $2 | "sort -un"}' }  function get_nr_cores_of_socket() {     grep 'cpu cores' /proc/cpuinfo | awk -F: '{             print $2 | "sort -un"}' }  echo '===== CPU Topology Table =====' echo  echo '+--------------+---------+-----------+' echo '| Processor ID | Core ID | Socket ID |' echo '+--------------+---------+-----------+'  while read line; do     if [ -z "$line" ]; then         printf '| %-12s | %-7s | %-9s |\n' $p_id $c_id $s_id         echo '+--------------+---------+-----------+'         continue     fi      if echo "$line" | grep -q "^processor"; then         p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi      if echo "$line" | grep -q "^core id"; then         c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi      if echo "$line" | grep -q "^physical id"; then         s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi done < /proc/cpuinfo  echo  awk -F: '{      if ($1 ~ /processor/) {         gsub(/ /,"",$2);         p_id=$2;     } else if ($1 ~ /physical id/){         gsub(/ /,"",$2);         s_id=$2;         arr[s_id]=arr[s_id] " " p_id     } }   END{     for (i in arr)          printf "Socket %s:%s\n", i, arr[i]; }' /proc/cpuinfo  echo echo '===== CPU Info Summary =====' echo  nr_processor=`get_nr_processor` echo "Logical processors: $nr_processor"  nr_socket=`get_nr_socket` echo "Physical socket: $nr_socket"  nr_siblings=`get_nr_siblings` echo "Siblings in one socket: $nr_siblings"  nr_cores=`get_nr_cores_of_socket` echo "Cores in one socket: $nr_cores"  let nr_cores*=nr_socket echo "Cores in total: $nr_cores"  if [ "$nr_cores" = "$nr_processor" ]; then     echo "Hyper-Threading: off" else     echo "Hyper-Threading: on" fi  echo echo '===== END =====' 
  • 运行CPU结构分析脚本:
./cpu_view.sh  ===== CPU Topology Table =====  +--------------+---------+-----------+ | Processor ID | Core ID | Socket ID | +--------------+---------+-----------+ | 0            | 0       | 0         | +--------------+---------+-----------+ | 1            | 1       | 0         | +--------------+---------+-----------+ | 2            | 2       | 0         | +--------------+---------+-----------+ | 3            | 3       | 0         | +--------------+---------+-----------+ | 4            | 4       | 0         | +--------------+---------+-----------+ | 5            | 8       | 0         | +--------------+---------+-----------+ | 6            | 9       | 0         | +--------------+---------+-----------+ | 7            | 10      | 0         | +--------------+---------+-----------+ | 8            | 11      | 0         | +--------------+---------+-----------+ | 9            | 16      | 0         | +--------------+---------+-----------+ | 10           | 17      | 0         | +--------------+---------+-----------+ | 11           | 18      | 0         | +--------------+---------+-----------+ | 12           | 19      | 0         | +--------------+---------+-----------+ | 13           | 20      | 0         | +--------------+---------+-----------+ | 14           | 24      | 0         | +--------------+---------+-----------+ | 15           | 25      | 0         | +--------------+---------+-----------+ | 16           | 26      | 0         | +--------------+---------+-----------+ | 17           | 27      | 0         | +--------------+---------+-----------+ | 18           | 0       | 1         | +--------------+---------+-----------+ | 19           | 1       | 1         | +--------------+---------+-----------+ | 20           | 2       | 1         | +--------------+---------+-----------+ | 21           | 3       | 1         | +--------------+---------+-----------+ | 22           | 4       | 1         | +--------------+---------+-----------+ | 23           | 8       | 1         | +--------------+---------+-----------+ | 24           | 9       | 1         | +--------------+---------+-----------+ | 25           | 10      | 1         | +--------------+---------+-----------+ | 26           | 11      | 1         | +--------------+---------+-----------+ | 27           | 16      | 1         | +--------------+---------+-----------+ | 28           | 17      | 1         | +--------------+---------+-----------+ | 29           | 18      | 1         | +--------------+---------+-----------+ | 30           | 19      | 1         | +--------------+---------+-----------+ | 31           | 20      | 1         | +--------------+---------+-----------+ | 32           | 24      | 1         | +--------------+---------+-----------+ | 33           | 25      | 1         | +--------------+---------+-----------+ | 34           | 26      | 1         | +--------------+---------+-----------+ | 35           | 27      | 1         | +--------------+---------+-----------+ | 36           | 0       | 0         | +--------------+---------+-----------+ | 37           | 1       | 0         | +--------------+---------+-----------+ | 38           | 2       | 0         | +--------------+---------+-----------+ | 39           | 3       | 0         | +--------------+---------+-----------+ | 40           | 4       | 0         | +--------------+---------+-----------+ | 41           | 8       | 0         | +--------------+---------+-----------+ | 42           | 9       | 0         | +--------------+---------+-----------+ | 43           | 10      | 0         | +--------------+---------+-----------+ | 44           | 11      | 0         | +--------------+---------+-----------+ | 45           | 16      | 0         | +--------------+---------+-----------+ | 46           | 17      | 0         | +--------------+---------+-----------+ | 47           | 18      | 0         | +--------------+---------+-----------+ | 48           | 19      | 0         | +--------------+---------+-----------+ | 49           | 20      | 0         | +--------------+---------+-----------+ | 50           | 24      | 0         | +--------------+---------+-----------+ | 51           | 25      | 0         | +--------------+---------+-----------+ | 52           | 26      | 0         | +--------------+---------+-----------+ | 53           | 27      | 0         | +--------------+---------+-----------+ | 54           | 0       | 1         | +--------------+---------+-----------+ | 55           | 1       | 1         | +--------------+---------+-----------+ | 56           | 2       | 1         | +--------------+---------+-----------+ | 57           | 3       | 1         | +--------------+---------+-----------+ | 58           | 4       | 1         | +--------------+---------+-----------+ | 59           | 8       | 1         | +--------------+---------+-----------+ | 60           | 9       | 1         | +--------------+---------+-----------+ | 61           | 10      | 1         | +--------------+---------+-----------+ | 62           | 11      | 1         | +--------------+---------+-----------+ | 63           | 16      | 1         | +--------------+---------+-----------+ | 64           | 17      | 1         | +--------------+---------+-----------+ | 65           | 18      | 1         | +--------------+---------+-----------+ | 66           | 19      | 1         | +--------------+---------+-----------+ | 67           | 20      | 1         | +--------------+---------+-----------+ | 68           | 24      | 1         | +--------------+---------+-----------+ | 69           | 25      | 1         | +--------------+---------+-----------+ | 70           | 26      | 1         | +--------------+---------+-----------+ | 71           | 27      | 1         | +--------------+---------+-----------+  Socket 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Socket 1: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  ===== CPU Info Summary =====  Logical processors: 72 Physical socket: 2 Siblings in one socket:  36 Cores in one socket:  18 Cores in total: 36 Hyper-Threading: on  ===== END ===== 

绑定测试

####创建普通虚拟机

  • 创建一个用于NUMA测试的模板:
$ openstack flavor create --vcpus 2 --ram 64 --disk 1 machine.numa 
  • 创建普通虚拟机:
$ openstack server create --image cirros --flavor machine.numa --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.numa1 
  • 查看libvirt配置:
$ openstack server show server.numa1 | grep instance_name | awk '{print $4}' $ virsh edit instance-0000001f ...   <memory unit='KiB'>65536</memory>   <currentMemory unit='KiB'>65536</currentMemory>   <vcpu placement='static'>2</vcpu>   <cputune>     <shares>2048</shares>   </cputune> ...   <cpu mode='host-model' check='partial'>     <model fallback='allow'/>     <topology sockets='2' cores='1' threads='1'/>   </cpu> ... 
  • 查看CPU亲和性和分配情况:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs taskset -c -p pid 180564's current affinity list: 0-71  $ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs ps -m -o pid,psr,comm -p    PID PSR COMMAND 180564   - qemu-kvm      -  61 -      -  22 -      -  13 -      -   1 -      -  58 - 
  • 设置模板的NUMA属性:
$ nova flavor-key machine.numa set hw:numa_nodes=1 hw:numa_cpus.0=0,1 hw:numa_mem.0=64 # nova flavor-key machine.numa unset hw:numa_nodes hw:numa_cpus.0 hw:numa_mem.0  $ openstack flavor show machine.numa +----------------------------+-------------------------------------------------------------+ | Field                      | Value                                                       | +----------------------------+-------------------------------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                                       | | OS-FLV-EXT-DATA:ephemeral  | 0                                                           | | access_project_ids         | None                                                        | | disk                       | 1                                                           | | id                         | fc37ea6f-3e69-422f-a05e-0ee56837a84d                        | | name                       | machine.numa                                                | | os-flavor-access:is_public | True                                                        | | properties                 | hw:numa_cpus.0='0,1', hw:numa_mem.0='64', hw:numa_nodes='1' | | ram                        | 64                                                          | | rxtx_factor                | 1.0                                                         | | swap                       |                                                             | | vcpus                      | 2                                                           | +----------------------------+-------------------------------------------------------------+ 
  • 重新启动之前创建的虚拟机:
$ openstack server stop server.numa1 $ openstack server start server.numa1 
  • NUMA属性未改变:
$ openstack server show server.numa1 | grep instance_name | awk '{print $4}' $ virsh edit instance-00000022 ...   <memory unit='KiB'>65536</memory>   <currentMemory unit='KiB'>65536</currentMemory>   <vcpu placement='static'>2</vcpu>   <cputune>     <shares>2048</shares>   </cputune> ...   <cpu mode='host-model' check='partial'>     <model fallback='allow'/>     <topology sockets='2' cores='1' threads='1'/>   </cpu> ...  
  • 查看CPU亲和性和分配情况:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs taskset -c -p pid 219152's current affinity list: 0-71  $ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/219152/task/219152/status:Cpus_allowed_list:	0-71 /proc/219152/task/219220/status:Cpus_allowed_list:	0-71 /proc/219152/task/219225/status:Cpus_allowed_list:	0-71 /proc/219152/task/219227/status:Cpus_allowed_list:	0-71 /proc/219152/task/219250/status:Cpus_allowed_list:	0-71  $ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs ps -m -o pid,psr,comm -p    PID PSR COMMAND 219152   - qemu-kvm      -  31 -      -  64 -      -   1 -      -  12 -      -  55 - 
  • 查看内存分配情况:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk 'NR==1 {print $2}' | xargs -I {} cat /proc/{}/numa_maps ...  55cdcffcd000 default file=/usr/libexec/qemu-kvm mapped=1328 mapmax=2 N0=1287 N1=41 kernelpagesize_kB=4 55cdd0944000 default file=/usr/libexec/qemu-kvm anon=248 dirty=248 N1=248 kernelpagesize_kB=4 55cdd0afc000 default file=/usr/libexec/qemu-kvm anon=95 dirty=95 N0=5 N1=90 kernelpagesize_kB=4 55cdd0b5c000 default anon=19 dirty=19 N0=4 N1=15 kernelpagesize_kB=4 55cdd1d83000 default heap anon=10419 dirty=10419 N0=1158 N1=9261 kernelpagesize_kB=4 7f35cbbba000 default 7f35cbbbb000 default anon=1 dirty=1 N0=1 kernelpagesize_kB=4 7f35cbcbb000 default 7f35cbcbc000 default anon=1 dirty=1 N0=1 kernelpagesize_kB=4 7f35cedc2000 default 7f35cedc3000 default anon=4 dirty=4 N0=4 kernelpagesize_kB=4 7f35d05c5000 default 7f35d05c6000 default anon=1 dirty=1 N1=1 kernelpagesize_kB=4 7f35d06c6000 default ... 

####创建绑定虚拟机

  • 创建一个设置NUMA参数的虚拟机,可以看到虚拟机的CPU都创建在Node 0上:
$ openstack server create --image cirros --flavor machine.numa --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.numa2 
  • 查看libvirt配置:
$ openstack server show server.numa2 | grep instance_name | awk '{print $4}' instance-00000024  $ virsh edit instance-00000024 ...   <memory unit='KiB'>65536</memory>   <currentMemory unit='KiB'>65536</currentMemory>   <vcpu placement='static'>2</vcpu>   <cputune>     <shares>2048</shares>     <vcpupin vcpu='0' cpuset='0-17,36-53'/>     <vcpupin vcpu='1' cpuset='0-17,36-53'/>     <emulatorpin cpuset='0-17,36-53'/>   </cputune>   <numatune>     <memory mode='strict' nodeset='0'/>     <memnode cellid='0' mode='strict' nodeset='0'/>   </numatune> ...    <cpu mode='host-model' check='partial'>     <model fallback='allow'/>     <topology sockets='2' cores='1' threads='1'/>     <numa>       <cell id='0' cpus='0-1' memory='65536' unit='KiB'/>     </numa> ... 
  • 查看CPU亲和性和绑定情况:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p pid 1139's current affinity list: 0-17,36-53  $ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/1139/task/1139/status:Cpus_allowed_list:	0-17,36-53 /proc/1139/task/1143/status:Cpus_allowed_list:	0-17,36-53 /proc/1139/task/1148/status:Cpus_allowed_list:	0-17,36-53 /proc/1139/task/1149/status:Cpus_allowed_list:	0-17,36-53 /proc/1139/task/1151/status:Cpus_allowed_list:	0-17,36-53  $ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p    PID PSR COMMAND   1139   - qemu-kvm      -   3 -      -   8 -      -   2 -      -   6 -      -  51 -  
  • 查看内存分配情况:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps ...  56133b0b3000 default file=/usr/libexec/qemu-kvm mapped=1326 mapmax=2 N0=1292 N1=34 kernelpagesize_kB=4 56133ba2a000 default file=/usr/libexec/qemu-kvm anon=248 dirty=248 N0=248 kernelpagesize_kB=4 56133bbe2000 default file=/usr/libexec/qemu-kvm anon=95 dirty=95 N0=95 kernelpagesize_kB=4 56133bc42000 default anon=19 dirty=19 N0=19 kernelpagesize_kB=4 56133db66000 default heap anon=3415 dirty=3415 N0=3415 kernelpagesize_kB=4 ... 

####内存分配对比

  • 内存分配对比测试脚本:
#!/usr/bin/perl  # Copyright (c) 2010, Jeremy Cole <jeremy@jcole.us>  # This program is free software; you can redistribute it and/or modify it # under the terms of either: the GNU General Public License as published # by the Free Software Foundation; or the Artistic License. #  # See http://dev.perl.org/licenses/ for more information.  # # This script expects a numa_maps file as input.  It is normally run in # the following way: # #     # perl numa-maps-summary.pl < /proc/pid/numa_maps # # Additionally, it can be used (of course) with saved numa_maps, and it # will also accept numa_maps output with ">" prefixes from an email quote. # It doesn't care what's in the output, it merely summarizes whatever it # finds. # # The output should look something like the following: # #     N0        :      7983584 ( 30.45 GB) #     N1        :      5440464 ( 20.75 GB) #     active    :     13406601 ( 51.14 GB) #     anon      :     13422697 ( 51.20 GB) #     dirty     :     13407242 ( 51.14 GB) #     mapmax    :          977 (  0.00 GB) #     mapped    :         1377 (  0.01 GB) #     swapcache :      3619780 ( 13.81 GB) #  use Data::Dumper;  sub parse_numa_maps_line($$) {   my ($line, $map) = @_;    if($line =~ /^[> ]*([0-9a-fA-F]+) (\S+)(.*)/)   {     my ($address, $policy, $flags) = ($1, $2, $3);      $map->{$address}->{'policy'} = $policy;      $flags =~ s/^\s+//g;     $flags =~ s/\s+$//g;     foreach my $flag (split / /, $flags)     {       my ($key, $value) = split /=/, $flag;       $map->{$address}->{'flags'}->{$key} = $value;     }   }  }  sub parse_numa_maps() {   my ($fd) = @_;   my $map = {};    while(my $line = <$fd>)   {     &parse_numa_maps_line($line, $map);    }   return $map; }  my $map = &parse_numa_maps(\*STDIN);  my $sums = {};  foreach my $address (keys %{$map}) {   if(exists($map->{$address}->{'flags'}))   {     my $flags = $map->{$address}->{'flags'};     foreach my $flag (keys %{$flags})     {       next if $flag eq 'file';       $sums->{$flag} += $flags->{$flag} if defined $flags->{$flag};     }   } }  foreach my $key (sort keys %{$sums}) {   printf "%-10s: %12i (%6.2f GB)\n", $key, $sums->{$key}, $sums->{$key}/262144; }  
  • 普通虚拟机内存在两个节点都有较多分配:
$ ps -aux | grep `openstack server show server.numa1 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl  N0        :         8721 (  0.03 GB) N1        :        21186 (  0.08 GB) active    :            0 (  0.00 GB) anon      :        26787 (  0.10 GB) dirty     :        26797 (  0.10 GB) kernelpagesize_kB:         1660 (  0.01 GB) mapmax    :         4116 (  0.02 GB) mapped    :         3110 (  0.01 GB) 
  • NUMA绑定的虚拟机几乎全部在Node1节点:
$ ps -aux | grep `openstack server show server.numa2 | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl N0        :        21731 (  0.08 GB) N1        :           34 (  0.00 GB) active    :            0 (  0.00 GB) anon      :        18647 (  0.07 GB) dirty     :        18657 (  0.07 GB) kernelpagesize_kB:         1708 (  0.01 GB) mapmax    :         4116 (  0.02 GB) mapped    :         3108 (  0.01 GB) 

###相关源码

虚拟机创建

  • 创建虚拟机时,进行参数构建和验证(nova/compute/api.py):
    @hooks.add_hook("create_instance")     def create(self, context, instance_type,                image_href, kernel_id=None, ramdisk_id=None,                min_count=None, max_count=None,                display_name=None, display_description=None,                key_name=None, key_data=None, security_groups=None,                availability_zone=None, forced_host=None, forced_node=None,                user_data=None, metadata=None, injected_files=None,                admin_password=None, block_device_mapping=None,                access_ip_v4=None, access_ip_v6=None, requested_networks=None,                config_drive=None, auto_disk_config=None, scheduler_hints=None,                legacy_bdm=True, shutdown_terminate=False,                check_server_group_quota=False):         """Provision instances, sending instance information to the         scheduler.  The scheduler will determine where the instance(s)         go and will handle creating the DB entries.          Returns a tuple of (instances, reservation_id)         """         if requested_networks and max_count is not None and max_count > 1:             self._check_multiple_instances_with_specified_ip(                 requested_networks)             if utils.is_neutron():                 self._check_multiple_instances_with_neutron_ports(                     requested_networks)          if availability_zone:             available_zones = availability_zones.\                 get_availability_zones(context.elevated(), True)             if forced_host is None and availability_zone not in \                     available_zones:                 msg = _('The requested availability zone is not available')                 raise exception.InvalidRequest(msg)          filter_properties = scheduler_utils.build_filter_properties(                 scheduler_hints, forced_host, forced_node, instance_type)          return self._create_instance(                        context, instance_type,                        image_href, kernel_id, ramdisk_id,                        min_count, max_count,                        display_name, display_description,                        key_name, key_data, security_groups,                        availability_zone, user_data, metadata,                        injected_files, admin_password,                        access_ip_v4, access_ip_v6,                        requested_networks, config_drive,                        block_device_mapping, auto_disk_config,                        filter_properties=filter_properties,                        legacy_bdm=legacy_bdm,                        shutdown_terminate=shutdown_terminate,                        check_server_group_quota=check_server_group_quota)      def _create_instance(self, context, instance_type,                image_href, kernel_id, ramdisk_id,                min_count, max_count,                display_name, display_description,                key_name, key_data, security_groups,                availability_zone, user_data, metadata, injected_files,                admin_password, access_ip_v4, access_ip_v6,                requested_networks, config_drive,                block_device_mapping, auto_disk_config, filter_properties,                reservation_id=None, legacy_bdm=True, shutdown_terminate=False,                check_server_group_quota=False):         """Verify all the input parameters regardless of the provisioning         strategy being performed and schedule the instance(s) for         creation.         """  # ...          base_options, max_net_count, key_pair, security_groups = \                 self._validate_and_build_base_options(                     context, instance_type, boot_meta, image_href, image_id,                     kernel_id, ramdisk_id, display_name, display_description,                     key_name, key_data, security_groups, availability_zone,                     user_data, metadata, access_ip_v4, access_ip_v6,                     requested_networks, config_drive, auto_disk_config,                     reservation_id, max_count)  #...      def _validate_and_build_base_options(self, context, instance_type,                                          boot_meta, image_href, image_id,                                          kernel_id, ramdisk_id, display_name,                                          display_description, key_name,                                          key_data, security_groups,                                          availability_zone, user_data,                                          metadata, access_ip_v4, access_ip_v6,                                          requested_networks, config_drive,                                          auto_disk_config, reservation_id,                                          max_count):         """Verify all the input parameters regardless of the provisioning         strategy being performed.         """ #...          numa_topology = hardware.numa_get_constraints(                 instance_type, image_meta)  #...  
  • 构建NUMA参数(nova/virt/hardware.py):
# TODO(sahid): Move numa related to hardware/numa.py def numa_get_constraints(flavor, image_meta):     """Return topology related to input request.      :param flavor: a flavor object to read extra specs from     :param image_meta: nova.objects.ImageMeta object instance      :raises: exception.InvalidNUMANodesNumber if the number of NUMA              nodes is less than 1 or not an integer     :raises: exception.ImageNUMATopologyForbidden if an attempt is made              to override flavor settings with image properties     :raises: exception.MemoryPageSizeInvalid if flavor extra spec or              image metadata provides an invalid hugepage value     :raises: exception.MemoryPageSizeForbidden if flavor extra spec              request conflicts with image metadata request     :raises: exception.ImageNUMATopologyIncomplete if the image              properties are not correctly specified     :raises: exception.ImageNUMATopologyAsymmetric if the number of              NUMA nodes is not a factor of the requested total CPUs or              memory     :raises: exception.ImageNUMATopologyCPUOutOfRange if an instance              CPU given in a NUMA mapping is not valid     :raises: exception.ImageNUMATopologyCPUDuplicates if an instance              CPU is specified in CPU mappings for two NUMA nodes     :raises: exception.ImageNUMATopologyCPUsUnassigned if an instance              CPU given in a NUMA mapping is not assigned to any NUMA node     :raises: exception.ImageNUMATopologyMemoryOutOfRange if sum of memory from              each NUMA node is not equal with total requested memory     :raises: exception.ImageCPUPinningForbidden if a CPU policy              specified in a flavor conflicts with one defined in image              metadata     :raises: exception.RealtimeConfigurationInvalid if realtime is              requested but dedicated CPU policy is not also requested     :raises: exception.RealtimeMaskNotFoundOrInvalid if realtime is              requested but no mask provided     :raises: exception.CPUThreadPolicyConfigurationInvalid if a CPU thread              policy conflicts with CPU allocation policy     :raises: exception.ImageCPUThreadPolicyForbidden if a CPU thread policy              specified in a flavor conflicts with one defined in image metadata     :returns: objects.InstanceNUMATopology, or None     """     flavor_nodes, image_nodes = _get_flavor_image_meta(         'numa_nodes', flavor, image_meta)     if flavor_nodes and image_nodes:         raise exception.ImageNUMATopologyForbidden(             name='hw_numa_nodes')      nodes = None     if flavor_nodes:         _validate_numa_nodes(flavor_nodes)         nodes = int(flavor_nodes)     else:         _validate_numa_nodes(image_nodes)         nodes = image_nodes      pagesize = _numa_get_pagesize_constraints(         flavor, image_meta)      numa_topology = None     if nodes or pagesize:         nodes = nodes or 1          cpu_list = _numa_get_cpu_map_list(flavor, image_meta)         mem_list = _numa_get_mem_map_list(flavor, image_meta)          # If one property list is specified both must be         if ((cpu_list is None and mem_list is not None) or             (cpu_list is not None and mem_list is None)):             raise exception.ImageNUMATopologyIncomplete()          # If any node has data set, all nodes must have data set         if ((cpu_list is not None and len(cpu_list) != nodes) or             (mem_list is not None and len(mem_list) != nodes)):             raise exception.ImageNUMATopologyIncomplete()          if cpu_list is None:             numa_topology = _numa_get_constraints_auto(                 nodes, flavor)         else:             numa_topology = _numa_get_constraints_manual(                 nodes, flavor, cpu_list, mem_list)          # We currently support same pagesize for all cells.         [setattr(c, 'pagesize', pagesize) for c in numa_topology.cells]      cpu_policy = _get_cpu_policy_constraints(flavor, image_meta)     cpu_thread_policy = _get_cpu_thread_policy_constraints(flavor, image_meta)     rt_mask = _get_realtime_mask(flavor, image_meta)      # sanity checks      rt = is_realtime_enabled(flavor)      if rt and cpu_policy != fields.CPUAllocationPolicy.DEDICATED:         raise exception.RealtimeConfigurationInvalid()      if rt and not rt_mask:         raise exception.RealtimeMaskNotFoundOrInvalid()      if cpu_policy == fields.CPUAllocationPolicy.SHARED:         if cpu_thread_policy:             raise exception.CPUThreadPolicyConfigurationInvalid()         return numa_topology      if numa_topology:         for cell in numa_topology.cells:             cell.cpu_policy = cpu_policy             cell.cpu_thread_policy = cpu_thread_policy     else:         single_cell = objects.InstanceNUMACell(                 id=0,                 cpuset=set(range(flavor.vcpus)),                 memory=flavor.memory_mb,                 cpu_policy=cpu_policy,                 cpu_thread_policy=cpu_thread_policy)         numa_topology = objects.InstanceNUMATopology(cells=[single_cell])      return numa_topology   def _get_cpu_policy_constraints(flavor, image_meta):     """Validate and return the requested CPU policy."""     flavor_policy, image_policy = _get_flavor_image_meta(         'cpu_policy', flavor, image_meta)      if flavor_policy == fields.CPUAllocationPolicy.DEDICATED:         cpu_policy = flavor_policy     elif flavor_policy == fields.CPUAllocationPolicy.SHARED:         if image_policy == fields.CPUAllocationPolicy.DEDICATED:             raise exception.ImageCPUPinningForbidden()         cpu_policy = flavor_policy     elif image_policy == fields.CPUAllocationPolicy.DEDICATED:         cpu_policy = image_policy     else:         cpu_policy = fields.CPUAllocationPolicy.SHARED      return cpu_policy   def _get_cpu_thread_policy_constraints(flavor, image_meta):     """Validate and return the requested CPU thread policy."""     flavor_policy, image_policy = _get_flavor_image_meta(         'cpu_thread_policy', flavor, image_meta)      if flavor_policy in [None, fields.CPUThreadAllocationPolicy.PREFER]:         policy = flavor_policy or image_policy     elif image_policy and image_policy != flavor_policy:         raise exception.ImageCPUThreadPolicyForbidden()     else:         policy = flavor_policy      return policy 

从中可见Flavor参数优先于Image参数。

####NUMA过滤器

  • NUMACPU绑定过滤器源码(nova/scheduler/filters/numa_topology_filter.py):
from oslo_log import log as logging  from nova import objects from nova.objects import fields from nova.scheduler import filters from nova.virt import hardware  LOG = logging.getLogger(__name__)   class NUMATopologyFilter(filters.BaseHostFilter):     """Filter on requested NUMA topology."""      RUN_ON_REBUILD = True      def _satisfies_cpu_policy(self, host_state, extra_specs, image_props):         """Check that the host_state provided satisfies any available         CPU policy requirements.         """         host_topology, _ = hardware.host_topology_and_format_from_host(             host_state)         # NOTE(stephenfin): There can be conflicts between the policy         # specified by the image and that specified by the instance, but this         # is not the place to resolve these. We do this during scheduling.         cpu_policy = [extra_specs.get('hw:cpu_policy'),                       image_props.get('hw_cpu_policy')]         cpu_thread_policy = [extra_specs.get('hw:cpu_thread_policy'),                              image_props.get('hw_cpu_thread_policy')]          if not host_topology:             return True          if fields.CPUAllocationPolicy.DEDICATED not in cpu_policy:             return True          if fields.CPUThreadAllocationPolicy.REQUIRE not in cpu_thread_policy:             return True          # the presence of siblings in at least one cell indicates         # hyperthreading (HT)         has_hyperthreading = any(cell.siblings for cell in host_topology.cells)          if not has_hyperthreading:             LOG.debug("%(host_state)s fails CPU policy requirements. "                       "Host does not have hyperthreading or "                       "hyperthreading is disabled, but 'require' threads "                       "policy was requested.", {'host_state': host_state})             return False          return True      def host_passes(self, host_state, spec_obj):         # TODO(stephenfin): The 'numa_fit_instance_to_host' function has the         # unfortunate side effect of modifying 'spec_obj.numa_topology' - an         # InstanceNUMATopology object - by populating the 'cpu_pinning' field.         # This is rather rude and said function should be reworked to avoid         # doing this. That's a large, non-backportable cleanup however, so for         # now we just duplicate spec_obj to prevent changes propagating to         # future filter calls.         spec_obj = spec_obj.obj_clone()          ram_ratio = host_state.ram_allocation_ratio         cpu_ratio = host_state.cpu_allocation_ratio         extra_specs = spec_obj.flavor.extra_specs         image_props = spec_obj.image.properties         requested_topology = spec_obj.numa_topology         host_topology, _fmt = hardware.host_topology_and_format_from_host(                 host_state)         pci_requests = spec_obj.pci_requests          if pci_requests:             pci_requests = pci_requests.requests          if not self._satisfies_cpu_policy(host_state, extra_specs,                                           image_props):             return False          if requested_topology and host_topology:             limits = objects.NUMATopologyLimits(                 cpu_allocation_ratio=cpu_ratio,                 ram_allocation_ratio=ram_ratio)             instance_topology = (hardware.numa_fit_instance_to_host(                         host_topology, requested_topology,                         limits=limits,                         pci_requests=pci_requests,                         pci_stats=host_state.pci_stats))             if not instance_topology:                 LOG.debug("%(host)s, %(node)s fails NUMA topology "                           "requirements. The instance does not fit on this "                           "host.", {'host': host_state.host,                                     'node': host_state.nodename},                           instance_uuid=spec_obj.instance_uuid)                 return False             host_state.limits['numa_topology'] = limits             return True         elif requested_topology:             LOG.debug("%(host)s, %(node)s fails NUMA topology requirements. "                       "No host NUMA topology while the instance specified "                       "one.",                       {'host': host_state.host, 'node': host_state.nodename},                       instance_uuid=spec_obj.instance_uuid)             return False         else:             return True 

这里对CPU、内存和NUMA技术都一并进行过滤,进行简单的资源数比较和特性匹配。

  • 获取主机状态中CPU属性,并与虚拟机请求中的要求进行比较(nova/virt/hardware.py):
# TODO(ndipanov): Remove when all code paths are using objects def host_topology_and_format_from_host(host):     """Extract numa topology from myriad host representations.      Until the RPC version is bumped to 5.x, a host may be represented     as a dict, a db object, an actual ComputeNode object, or an     instance of HostState class. Identify the type received and return     either an instance of objects.NUMATopology if host's NUMA topology     is available, else None.      :returns: A two-tuple. The first element is either an instance of               objects.NUMATopology or None. The second element is a               boolean set to True if topology was in JSON format.     """     was_json = False     try:         host_numa_topology = host.get('numa_topology')     except AttributeError:         host_numa_topology = host.numa_topology      if host_numa_topology is not None and isinstance(             host_numa_topology, six.string_types):         was_json = True          host_numa_topology = (objects.NUMATopology.obj_from_db_obj(             host_numa_topology))      return host_numa_topology, was_json  def numa_fit_instance_to_host(         host_topology, instance_topology, limits=None,         pci_requests=None, pci_stats=None):     """Fit the instance topology onto the host topology.      Given a host, instance topology, and (optional) limits, attempt to     fit instance cells onto all permutations of host cells by calling     the _fit_instance_cell method, and return a new InstanceNUMATopology     with its cell ids set to host cell ids of the first successful     permutation, or None.      :param host_topology: objects.NUMATopology object to fit an                           instance on     :param instance_topology: objects.InstanceNUMATopology to be fitted     :param limits: objects.NUMATopologyLimits that defines limits     :param pci_requests: instance pci_requests     :param pci_stats: pci_stats for the host      :returns: objects.InstanceNUMATopology with its cell IDs set to host               cell ids of the first successful permutation, or None     """     if not (host_topology and instance_topology):         LOG.debug("Require both a host and instance NUMA topology to "                   "fit instance on host.")         return     elif len(host_topology) < len(instance_topology):         LOG.debug("There are not enough NUMA nodes on the system to schedule "                   "the instance correctly. Required: %(required)s, actual: "                   "%(actual)s",                   {'required': len(instance_topology),                    'actual': len(host_topology)})         return      # TODO(ndipanov): We may want to sort permutations differently     # depending on whether we want packing/spreading over NUMA nodes     for host_cell_perm in itertools.permutations(             host_topology.cells, len(instance_topology)):         cells = []         for host_cell, instance_cell in zip(                 host_cell_perm, instance_topology.cells):             try:                 got_cell = _numa_fit_instance_cell(                     host_cell, instance_cell, limits)             except exception.MemoryPageSizeNotSupported:                 # This exception will been raised if instance cell's                 # custom pagesize is not supported with host cell in                 # _numa_cell_supports_pagesize_request function.                 break             if got_cell is None:                 break             cells.append(got_cell)          if len(cells) != len(host_cell_perm):             continue          if not pci_requests or ((pci_stats is not None) and                 pci_stats.support_requests(pci_requests, cells)):             return objects.InstanceNUMATopology(cells=cells) 

####虚拟化驱动

  • 虚拟化驱动基类,创建虚拟实例函数spawn(nova/virt/driver.py):
class ComputeDriver(object):     """Base class for compute drivers.      The interface to this class talks in terms of 'instances' (Amazon EC2 and     internal Nova terminology), by which we mean 'running virtual machine'     (XenAPI terminology) or domain (Xen or libvirt terminology).      An instance has an ID, which is the identifier chosen by Nova to represent     the instance further up the stack.  This is unfortunately also called a     'name' elsewhere.  As far as this layer is concerned, 'instance ID' and     'instance name' are synonyms.      Note that the instance ID or name is not human-readable or     customer-controlled -- it's an internal ID chosen by Nova.  At the     nova.virt layer, instances do not have human-readable names at all -- such     things are only known higher up the stack.      Most virtualization platforms will also have their own identity schemes,     to uniquely identify a VM or domain.  These IDs must stay internal to the     platform-specific layer, and never escape the connection interface.  The     platform-specific layer is responsible for keeping track of which instance     ID maps to which platform-specific ID, and vice versa.      Some methods here take an instance of nova.compute.service.Instance.  This     is the data structure used by nova.compute to store details regarding an     instance, and pass them into this layer.  This layer is responsible for     translating that generic data structure into terms that are specific to the     virtualization platform.      """      def spawn(self, context, instance, image_meta, injected_files,               admin_password, network_info=None, block_device_info=None):         """Create a new instance/VM/domain on the virtualization platform.          Once this successfully completes, the instance should be         running (power_state.RUNNING).          If this fails, any partial instance should be completely         cleaned up, and the virtualization platform should be in the state         that it was before this call began.          :param context: security context         :param instance: nova.objects.instance.Instance                          This function should use the data there to guide                          the creation of the new instance.         :param nova.objects.ImageMeta image_meta:             The metadata of the image of the instance.         :param injected_files: User files to inject into instance.         :param admin_password: Administrator password to set in instance.         :param network_info: instance network information         :param block_device_info: Information about block devices to be                                   attached to the instance.         """         raise NotImplementedError() 
  • 虚拟机实例的主要属性(nova/objects/instance.py):
# TODO(berrange): Remove NovaObjectDictCompat @base.NovaObjectRegistry.register class Instance(base.NovaPersistentObject, base.NovaObject,                base.NovaObjectDictCompat):     # Version 2.0: Initial version     # Version 2.1: Added services     # Version 2.2: Added keypairs     # Version 2.3: Added device_metadata     VERSION = '2.3'      fields = {         'id': fields.IntegerField(),          'user_id': fields.StringField(nullable=True),         'project_id': fields.StringField(nullable=True),          'image_ref': fields.StringField(nullable=True),         'kernel_id': fields.StringField(nullable=True),         'ramdisk_id': fields.StringField(nullable=True),         'hostname': fields.StringField(nullable=True),          'launch_index': fields.IntegerField(nullable=True),         'key_name': fields.StringField(nullable=True),         'key_data': fields.StringField(nullable=True),          'power_state': fields.IntegerField(nullable=True),         'vm_state': fields.StringField(nullable=True),         'task_state': fields.StringField(nullable=True),          'services': fields.ObjectField('ServiceList'),          'memory_mb': fields.IntegerField(nullable=True),         'vcpus': fields.IntegerField(nullable=True),         'root_gb': fields.IntegerField(nullable=True),         'ephemeral_gb': fields.IntegerField(nullable=True),         'ephemeral_key_uuid': fields.UUIDField(nullable=True),          'host': fields.StringField(nullable=True),         'node': fields.StringField(nullable=True),          'instance_type_id': fields.IntegerField(nullable=True),          'user_data': fields.StringField(nullable=True),          'reservation_id': fields.StringField(nullable=True),          'launched_at': fields.DateTimeField(nullable=True),         'terminated_at': fields.DateTimeField(nullable=True),          'availability_zone': fields.StringField(nullable=True),          'display_name': fields.StringField(nullable=True),         'display_description': fields.StringField(nullable=True),          'launched_on': fields.StringField(nullable=True),          # NOTE(jdillaman): locked deprecated in favor of locked_by,         # to be removed in Icehouse         'locked': fields.BooleanField(default=False),         'locked_by': fields.StringField(nullable=True),          'os_type': fields.StringField(nullable=True),         'architecture': fields.StringField(nullable=True),         'vm_mode': fields.StringField(nullable=True),         'uuid': fields.UUIDField(),          'root_device_name': fields.StringField(nullable=True),         'default_ephemeral_device': fields.StringField(nullable=True),         'default_swap_device': fields.StringField(nullable=True),         'config_drive': fields.StringField(nullable=True),          'access_ip_v4': fields.IPV4AddressField(nullable=True),         'access_ip_v6': fields.IPV6AddressField(nullable=True),          'auto_disk_config': fields.BooleanField(default=False),         'progress': fields.IntegerField(nullable=True),          'shutdown_terminate': fields.BooleanField(default=False),         'disable_terminate': fields.BooleanField(default=False),          'cell_name': fields.StringField(nullable=True),          'metadata': fields.DictOfStringsField(),         'system_metadata': fields.DictOfNullableStringsField(),          'info_cache': fields.ObjectField('InstanceInfoCache',                                          nullable=True),          'security_groups': fields.ObjectField('SecurityGroupList'),          'fault': fields.ObjectField('InstanceFault', nullable=True),          'cleaned': fields.BooleanField(default=False),          'pci_devices': fields.ObjectField('PciDeviceList', nullable=True),         'numa_topology': fields.ObjectField('InstanceNUMATopology',                                             nullable=True),         'pci_requests': fields.ObjectField('InstancePCIRequests',                                            nullable=True),         'device_metadata': fields.ObjectField('InstanceDeviceMetadata',                                               nullable=True),         'tags': fields.ObjectField('TagList'),         'flavor': fields.ObjectField('Flavor'),         'old_flavor': fields.ObjectField('Flavor', nullable=True),         'new_flavor': fields.ObjectField('Flavor', nullable=True),         'vcpu_model': fields.ObjectField('VirtCPUModel', nullable=True),         'ec2_ids': fields.ObjectField('EC2Ids'),         'migration_context': fields.ObjectField('MigrationContext',                                                 nullable=True),         'keypairs': fields.ObjectField('KeyPairList'),         }      obj_extra_fields = ['name'] 
  • libvirt虚拟化驱动的spawn实现(nova/virt/libvirt/driver.py ):
class LibvirtDriver(driver.ComputeDriver):      # NOTE(ilyaalekseyev): Implementation like in multinics     # for xenapi(tr3buchet)     def spawn(self, context, instance, image_meta, injected_files,               admin_password, network_info=None, block_device_info=None):         disk_info = blockinfo.get_disk_info(CONF.libvirt.virt_type,                                             instance,                                             image_meta,                                             block_device_info)         injection_info = InjectionInfo(network_info=network_info,                                        files=injected_files,                                        admin_pass=admin_password)         gen_confdrive = functools.partial(self._create_configdrive,                                           context, instance,                                           injection_info)         self._create_image(context, instance, disk_info['mapping'],                            injection_info=injection_info,                            block_device_info=block_device_info)          # Required by Quobyte CI         self._ensure_console_log_for_instance(instance)          xml = self._get_guest_xml(context, instance, network_info,                                   disk_info, image_meta,                                   block_device_info=block_device_info)         self._create_domain_and_network(             context, xml, instance, network_info, disk_info,             block_device_info=block_device_info,             post_xml_callback=gen_confdrive,             destroy_disks_on_failure=True)         LOG.debug("Instance is running", instance=instance)          def _wait_for_boot():             """Called at an interval until the VM is running."""             state = self.get_info(instance).state              if state == power_state.RUNNING:                 LOG.info(_LI("Instance spawned successfully."),                          instance=instance)                 raise loopingcall.LoopingCallDone()          timer = loopingcall.FixedIntervalLoopingCall(_wait_for_boot)         timer.start(interval=0.5).wait() 
  • 生成虚拟机xml配置(nova/virt/libvirt/driver.py ):
    def _get_guest_xml(self, context, instance, network_info, disk_info,                        image_meta, rescue=None,                        block_device_info=None):         # NOTE(danms): Stringifying a NetworkInfo will take a lock. Do         # this ahead of time so that we don't acquire it while also         # holding the logging lock.         network_info_str = str(network_info)         msg = ('Start _get_guest_xml '                'network_info=%(network_info)s '                'disk_info=%(disk_info)s '                'image_meta=%(image_meta)s rescue=%(rescue)s '                'block_device_info=%(block_device_info)s' %                {'network_info': network_info_str, 'disk_info': disk_info,                 'image_meta': image_meta, 'rescue': rescue,                 'block_device_info': block_device_info})         # NOTE(mriedem): block_device_info can contain auth_password so we         # need to sanitize the password in the message.         LOG.debug(strutils.mask_password(msg), instance=instance)         conf = self._get_guest_config(instance, network_info, image_meta,                                       disk_info, rescue, block_device_info,                                       context)         xml = conf.to_xml()          LOG.debug('End _get_guest_xml xml=%(xml)s',                   {'xml': xml}, instance=instance)         return xml 
  • 生成虚拟机基本配置(nova/virt/libvirt/driver.py ):
    def _get_guest_config(self, instance, network_info, image_meta,                           disk_info, rescue=None, block_device_info=None,                           context=None):         """Get config data for parameters.          :param rescue: optional dictionary that should contain the key             'ramdisk_id' if a ramdisk is needed for the rescue image and             'kernel_id' if a kernel is needed for the rescue image.         """         flavor = instance.flavor         inst_path = libvirt_utils.get_instance_path(instance)         disk_mapping = disk_info['mapping']          virt_type = CONF.libvirt.virt_type         guest = vconfig.LibvirtConfigGuest()         guest.virt_type = virt_type         guest.name = instance.name         guest.uuid = instance.uuid         # We are using default unit for memory: KiB         guest.memory = flavor.memory_mb * units.Ki         guest.vcpus = flavor.vcpus         allowed_cpus = hardware.get_vcpu_pin_set()          guest_numa_config = self._get_guest_numa_config(             instance.numa_topology, flavor, allowed_cpus, image_meta)          guest.cpuset = guest_numa_config.cpuset         guest.cputune = guest_numa_config.cputune         guest.numatune = guest_numa_config.numatune          guest.membacking = self._get_guest_memory_backing_config(             instance.numa_topology,             guest_numa_config.numatune,             flavor)          guest.metadata.append(self._get_guest_config_meta(instance))         guest.idmaps = self._get_guest_idmaps()          for event in self._supported_perf_events:             guest.add_perf_event(event)          self._update_guest_cputune(guest, flavor, virt_type)          guest.cpu = self._get_guest_cpu_config(             flavor, image_meta, guest_numa_config.numaconfig,             instance.numa_topology)          # Notes(yjiang5): we always sync the instance's vcpu model with         # the corresponding config file.         instance.vcpu_model = self._cpu_config_to_vcpu_model(             guest.cpu, instance.vcpu_model)          if 'root' in disk_mapping:             root_device_name = block_device.prepend_dev(                 disk_mapping['root']['dev'])         else:             root_device_name = None          if root_device_name:             # NOTE(yamahata):             # for nova.api.ec2.cloud.CloudController.get_metadata()             instance.root_device_name = root_device_name          guest.os_type = (fields.VMMode.get_from_instance(instance) or                 self._get_guest_os_type(virt_type))         caps = self._host.get_capabilities()          self._configure_guest_by_virt_type(guest, virt_type, caps, instance,                                            image_meta, flavor,                                            root_device_name)         if virt_type not in ('lxc', 'uml'):             self._conf_non_lxc_uml(virt_type, guest, root_device_name, rescue,                     instance, inst_path, image_meta, disk_info)          self._set_features(guest, instance.os_type, caps, virt_type)         self._set_clock(guest, instance.os_type, image_meta, virt_type)          storage_configs = self._get_guest_storage_config(                 instance, image_meta, disk_info, rescue, block_device_info,                 flavor, guest.os_type)         for config in storage_configs:             guest.add_device(config)          for vif in network_info:             config = self.vif_driver.get_config(                 instance, vif, image_meta,                 flavor, virt_type, self._host)             guest.add_device(config)          self._create_consoles(virt_type, guest, instance, flavor, image_meta)          pointer = self._get_guest_pointer_model(guest.os_type, image_meta)         if pointer:             guest.add_device(pointer)          if (CONF.spice.enabled and CONF.spice.agent_enabled and                 virt_type not in ('lxc', 'uml', 'xen')):             channel = vconfig.LibvirtConfigGuestChannel()             channel.type = 'spicevmc'             channel.target_name = "com.redhat.spice.0"             guest.add_device(channel)          # NB some versions of libvirt support both SPICE and VNC         # at the same time. We're not trying to second guess which         # those versions are. We'll just let libvirt report the         # errors appropriately if the user enables both.         add_video_driver = False         if ((CONF.vnc.enabled and              virt_type not in ('lxc', 'uml'))):             graphics = vconfig.LibvirtConfigGuestGraphics()             graphics.type = "vnc"             graphics.keymap = CONF.vnc.keymap             graphics.listen = CONF.vnc.vncserver_listen             guest.add_device(graphics)             add_video_driver = True          if (CONF.spice.enabled and                 virt_type not in ('lxc', 'uml', 'xen')):             graphics = vconfig.LibvirtConfigGuestGraphics()             graphics.type = "spice"             graphics.keymap = CONF.spice.keymap             graphics.listen = CONF.spice.server_listen             guest.add_device(graphics)             add_video_driver = True          if add_video_driver:             self._add_video_driver(guest, image_meta, flavor)          # Qemu guest agent only support 'qemu' and 'kvm' hypervisor         if virt_type in ('qemu', 'kvm'):             self._set_qemu_guest_agent(guest, flavor, instance, image_meta)          if virt_type in ('xen', 'qemu', 'kvm'):             # Get all generic PCI devices (non-SR-IOV).             for pci_dev in pci_manager.get_instance_pci_devs(instance):                 guest.add_device(self._get_guest_pci_device(pci_dev))         else:             # PCI devices is only supported for hypervisor 'xen', 'qemu' and             # 'kvm'.             pci_devs = pci_manager.get_instance_pci_devs(instance, 'all')             if len(pci_devs) > 0:                 raise exception.PciDeviceUnsupportedHypervisor(                     type=virt_type)          # image meta takes precedence over flavor extra specs; disable the         # watchdog action by default         watchdog_action = (flavor.extra_specs.get('hw:watchdog_action')                            or 'disabled')         watchdog_action = image_meta.properties.get('hw_watchdog_action',                                                     watchdog_action)          # NB(sross): currently only actually supported by KVM/QEmu         if watchdog_action != 'disabled':             if watchdog_action in fields.WatchdogAction.ALL:                 bark = vconfig.LibvirtConfigGuestWatchdog()                 bark.action = watchdog_action                 guest.add_device(bark)             else:                 raise exception.InvalidWatchdogAction(action=watchdog_action)          # Memory balloon device only support 'qemu/kvm' and 'xen' hypervisor         if (virt_type in ('xen', 'qemu', 'kvm') and                 CONF.libvirt.mem_stats_period_seconds > 0):             balloon = vconfig.LibvirtConfigMemoryBalloon()             if virt_type in ('qemu', 'kvm'):                 balloon.model = 'virtio'             else:                 balloon.model = 'xen'             balloon.period = CONF.libvirt.mem_stats_period_seconds             guest.add_device(balloon)          return guest 
  • 获取客户机NUMA配置(nova/virt/libvirt/driver.py ):
    def _get_guest_numa_config(self, instance_numa_topology, flavor,                                allowed_cpus=None, image_meta=None):         """Returns the config objects for the guest NUMA specs.          Determines the CPUs that the guest can be pinned to if the guest         specifies a cell topology and the host supports it. Constructs the         libvirt XML config object representing the NUMA topology selected         for the guest. Returns a tuple of:              (cpu_set, guest_cpu_tune, guest_cpu_numa, guest_numa_tune)          With the following caveats:              a) If there is no specified guest NUMA topology, then                all tuple elements except cpu_set shall be None. cpu_set                will be populated with the chosen CPUs that the guest                allowed CPUs fit within, which could be the supplied                allowed_cpus value if the host doesn't support NUMA                topologies.              b) If there is a specified guest NUMA topology, then                cpu_set will be None and guest_cpu_numa will be the                LibvirtConfigGuestCPUNUMA object representing the guest's                NUMA topology. If the host supports NUMA, then guest_cpu_tune                will contain a LibvirtConfigGuestCPUTune object representing                the optimized chosen cells that match the host capabilities                with the instance's requested topology. If the host does                not support NUMA, then guest_cpu_tune and guest_numa_tune                will be None.         """          if (not self._has_numa_support() and                 instance_numa_topology is not None):             # We should not get here, since we should have avoided             # reporting NUMA topology from _get_host_numa_topology             # in the first place. Just in case of a scheduler             # mess up though, raise an exception             raise exception.NUMATopologyUnsupported()          topology = self._get_host_numa_topology()          # We have instance NUMA so translate it to the config class         guest_cpu_numa_config = self._get_cpu_numa_config_from_instance(                 instance_numa_topology,                 self._wants_hugepages(topology, instance_numa_topology))          if not guest_cpu_numa_config:             # No NUMA topology defined for instance - let the host kernel deal             # with the NUMA effects.             # TODO(ndipanov): Attempt to spread the instance             # across NUMA nodes and expose the topology to the             # instance as an optimisation             return GuestNumaConfig(allowed_cpus, None, None, None)         else:             if topology:                 # Now get the CpuTune configuration from the numa_topology                 guest_cpu_tune = vconfig.LibvirtConfigGuestCPUTune()                 guest_numa_tune = vconfig.LibvirtConfigGuestNUMATune()                 emupcpus = []                  numa_mem = vconfig.LibvirtConfigGuestNUMATuneMemory()                 numa_memnodes = [vconfig.LibvirtConfigGuestNUMATuneMemNode()                                  for _ in guest_cpu_numa_config.cells]                  vcpus_rt = set([])                 wants_realtime = hardware.is_realtime_enabled(flavor)                 if wants_realtime:                     if not self._host.has_min_version(                             MIN_LIBVIRT_REALTIME_VERSION):                         raise exception.RealtimePolicyNotSupported()                     # Prepare realtime config for libvirt                     vcpus_rt = hardware.vcpus_realtime_topology(                         flavor, image_meta)                     vcpusched = vconfig.LibvirtConfigGuestCPUTuneVCPUSched()                     vcpusched.vcpus = vcpus_rt                     vcpusched.scheduler = "fifo"                     vcpusched.priority = (                         CONF.libvirt.realtime_scheduler_priority)                     guest_cpu_tune.vcpusched.append(vcpusched)                  for host_cell in topology.cells:                     for guest_node_id, guest_config_cell in enumerate(                             guest_cpu_numa_config.cells):                         if guest_config_cell.id == host_cell.id:                             node = numa_memnodes[guest_node_id]                             node.cellid = guest_node_id                             node.nodeset = [host_cell.id]                             node.mode = "strict"                              numa_mem.nodeset.append(host_cell.id)                              object_numa_cell = (                                     instance_numa_topology.cells[guest_node_id]                                 )                             for cpu in guest_config_cell.cpus:                                 pin_cpuset = (                                     vconfig.LibvirtConfigGuestCPUTuneVCPUPin())                                 pin_cpuset.id = cpu                                 # If there is pinning information in the cell                                 # we pin to individual CPUs, otherwise we float                                 # over the whole host NUMA node                                  if (object_numa_cell.cpu_pinning and                                         self._has_cpu_policy_support()):                                     pcpu = object_numa_cell.cpu_pinning[cpu]                                     pin_cpuset.cpuset = set([pcpu])                                 else:                                     pin_cpuset.cpuset = host_cell.cpuset                                 if not wants_realtime or cpu not in vcpus_rt:                                     # - If realtime IS NOT enabled, the                                     #   emulator threads are allowed to float                                     #   across all the pCPUs associated with                                     #   the guest vCPUs ("not wants_realtime"                                     #   is true, so we add all pcpus)                                     # - If realtime IS enabled, then at least                                     #   1 vCPU is required to be set aside for                                     #   non-realtime usage. The emulator                                     #   threads are allowed to float acros the                                     #   pCPUs that are associated with the                                     #   non-realtime VCPUs (the "cpu not in                                     #   vcpu_rt" check deals with this                                     #   filtering)                                     emupcpus.extend(pin_cpuset.cpuset)                                 guest_cpu_tune.vcpupin.append(pin_cpuset)                  # TODO(berrange) When the guest has >1 NUMA node, it will                 # span multiple host NUMA nodes. By pinning emulator threads                 # to the union of all nodes, we guarantee there will be                 # cross-node memory access by the emulator threads when                 # responding to guest I/O operations. The only way to avoid                 # this would be to pin emulator threads to a single node and                 # tell the guest OS to only do I/O from one of its virtual                 # NUMA nodes. This is not even remotely practical.                 #                 # The long term solution is to make use of a new QEMU feature                 # called "I/O Threads" which will let us configure an explicit                 # I/O thread for each guest vCPU or guest NUMA node. It is                 # still TBD how to make use of this feature though, especially                 # how to associate IO threads with guest devices to eliminiate                 # cross NUMA node traffic. This is an area of investigation                 # for QEMU community devs.                 emulatorpin = vconfig.LibvirtConfigGuestCPUTuneEmulatorPin()                 emulatorpin.cpuset = set(emupcpus)                 guest_cpu_tune.emulatorpin = emulatorpin                 # Sort the vcpupin list per vCPU id for human-friendlier XML                 guest_cpu_tune.vcpupin.sort(key=operator.attrgetter("id"))                  guest_numa_tune.memory = numa_mem                 guest_numa_tune.memnodes = numa_memnodes                  # normalize cell.id                 for i, (cell, memnode) in enumerate(                                             zip(guest_cpu_numa_config.cells,                                                 guest_numa_tune.memnodes)):                     cell.id = i                     memnode.cellid = i                  return GuestNumaConfig(None, guest_cpu_tune,                                        guest_cpu_numa_config,                                        guest_numa_tune)             else:                 return GuestNumaConfig(allowed_cpus, None,                                        guest_cpu_numa_config, None) 
  • 获取客户机内存后端配置(nova/virt/libvirt/driver.py ):
    def _get_guest_memory_backing_config(             self, inst_topology, numatune, flavor):         wantsmempages = False         if inst_topology:             for cell in inst_topology.cells:                 if cell.pagesize:                     wantsmempages = True                     break          wantsrealtime = hardware.is_realtime_enabled(flavor)          membacking = None         if wantsmempages:             pages = self._get_memory_backing_hugepages_support(                 inst_topology, numatune)             if pages:                 membacking = vconfig.LibvirtConfigGuestMemoryBacking()                 membacking.hugepages = pages         if wantsrealtime:             if not membacking:                 membacking = vconfig.LibvirtConfigGuestMemoryBacking()             membacking.locked = True             membacking.sharedpages = False          return membacking 
  • 获取客户机配置元数据(nova/virt/libvirt/driver.py ):
    def _get_guest_config_meta(self, instance):         """Get metadata config for guest."""          meta = vconfig.LibvirtConfigGuestMetaNovaInstance()         meta.package = version.version_string_with_package()         meta.name = instance.display_name         meta.creationTime = time.time()          if instance.image_ref not in ("", None):             meta.roottype = "image"             meta.rootid = instance.image_ref          system_meta = instance.system_metadata         ometa = vconfig.LibvirtConfigGuestMetaNovaOwner()         ometa.userid = instance.user_id         ometa.username = system_meta.get('owner_user_name', 'N/A')         ometa.projectid = instance.project_id         ometa.projectname = system_meta.get('owner_project_name', 'N/A')         meta.owner = ometa          fmeta = vconfig.LibvirtConfigGuestMetaNovaFlavor()         flavor = instance.flavor         fmeta.name = flavor.name         fmeta.memory = flavor.memory_mb         fmeta.vcpus = flavor.vcpus         fmeta.ephemeral = flavor.ephemeral_gb         fmeta.disk = flavor.root_gb         fmeta.swap = flavor.swap          meta.flavor = fmeta          return meta 
  • 更新客户机CPU设置参数(nova/virt/libvirt/driver.py ):
    def _update_guest_cputune(self, guest, flavor, virt_type):         is_able = self._host.is_cpu_control_policy_capable()          cputuning = ['shares', 'period', 'quota']         wants_cputune = any([k for k in cputuning             if "quota:cpu_" + k in flavor.extra_specs.keys()])          if wants_cputune and not is_able:             raise exception.UnsupportedHostCPUControlPolicy()          if not is_able or virt_type not in ('lxc', 'kvm', 'qemu'):             return          if guest.cputune is None:             guest.cputune = vconfig.LibvirtConfigGuestCPUTune()             # Setting the default cpu.shares value to be a value             # dependent on the number of vcpus         guest.cputune.shares = 1024 * guest.vcpus          for name in cputuning:             key = "quota:cpu_" + name             if key in flavor.extra_specs:                 setattr(guest.cputune, name,                         int(flavor.extra_specs[key]))   
  • 获取客户机CPU配置(nova/virt/libvirt/driver.py ):
    def _get_guest_cpu_config(self, flavor, image_meta,                               guest_cpu_numa_config, instance_numa_topology):         cpu = self._get_guest_cpu_model_config()          if cpu is None:             return None          topology = hardware.get_best_cpu_topology(                 flavor, image_meta, numa_topology=instance_numa_topology)          cpu.sockets = topology.sockets         cpu.cores = topology.cores         cpu.threads = topology.threads         cpu.numa = guest_cpu_numa_config          return cpu 
  • 根据配置生成内部CPU模型(nova/virt/libvirt/driver.py ):
    def _cpu_config_to_vcpu_model(self, cpu_config, vcpu_model):         """Update VirtCPUModel object according to libvirt CPU config.          :param:cpu_config: vconfig.LibvirtConfigGuestCPU presenting the                            instance's virtual cpu configuration.         :param:vcpu_model: VirtCPUModel object. A new object will be created                            if None.          :return: Updated VirtCPUModel object, or None if cpu_config is None          """          if not cpu_config:             return         if not vcpu_model:             vcpu_model = objects.VirtCPUModel()          vcpu_model.arch = cpu_config.arch         vcpu_model.vendor = cpu_config.vendor         vcpu_model.model = cpu_config.model         vcpu_model.mode = cpu_config.mode         vcpu_model.match = cpu_config.match          if cpu_config.sockets:             vcpu_model.topology = objects.VirtCPUTopology(                 sockets=cpu_config.sockets,                 cores=cpu_config.cores,                 threads=cpu_config.threads)         else:             vcpu_model.topology = None          features = [objects.VirtCPUFeature(             name=f.name,             policy=f.policy) for f in cpu_config.features]         vcpu_model.features = features          return vcpu_model 

##CPU绑定

为了减少CPU竞争,提高CPU Cache命中率,可以把GuestvCPU绑定到HostpCPU上。

参数配置

####过时参数

注意:如下参数已过时!!!

CPU绑定的Flavor Extra Specs配置:

hw:cpu_policy=shared|dedicated hw:cpu_thread_policy=avoid|separate|isolate|prefer 

CPU绑定的Image Metadata配置:

hw_cpu_policy=shared|dedicated hw_cpu_thread_policy=avoid|separate|isolate|prefer 

如果hw:cpu_policy=shared,则和现有的默认CPU配置一样,没有的CPU 绑定。

如果hw:cpu_policy=dedicated则进行CPU绑定。

绑定策略有:

  • avoid - Guest不会调度到有超线程的Host上。

img

  • separate - 每个vCPU到放置到不同的Core

img

  • isolate - 每个vCPU到放置到不同的Core上,并且独占这个Core,其他vCPU不能再放置到该Core上。

img

  • prefer - GuestvCPU放置到同一Core上,让vCPU成为Siblings Thread

img

####最新参数

使用上述参数在Ocata版本上测试时,发现有问题。

  • Ocata版本的源码中的说明(nova/objects/fields.py):
class CPUThreadAllocationPolicy(BaseNovaEnum):      # prefer (default): The host may or may not have hyperthreads. This     #  retains the legacy behavior, whereby siblings are preferred when     #  available. This is the default if no policy is specified.     PREFER = "prefer"     # isolate: The host may or many not have hyperthreads. If hyperthreads are     #  present, each vCPU will be placed on a different core and no vCPUs from     #  other guests will be able to be placed on the same core, i.e. one     #  thread sibling is guaranteed to always be unused. If hyperthreads are     #  not present, each vCPU will still be placed on a different core and     #  there are no thread siblings to be concerned with.     ISOLATE = "isolate"     # require: The host must have hyperthreads. Each vCPU will be allocated on     #   thread siblings.     REQUIRE = "require"      ALL = (PREFER, ISOLATE, REQUIRE) 
Proposed change The flavor extra specs will be enhanced to support one new parameter:  hw:cpu_thread_policy=prefer|isolate|require This policy is an extension to the already implemented CPU policy parameter:  hw:cpu_policy=shared|dedicated  The threads policy will control how the scheduler / virt driver places guests with respect to CPU threads. It will only apply if the CPU policy is ‘dedicated’, i.e. guest vCPUs are being pinned to host pCPUs.  prefer: The host may or may not have an SMT architecture. This retains the legacy behavior, whereby siblings are preferred when available. This is the default if no policy is specified.  isolate: The host must not have an SMT architecture, or must emulate a non-SMT architecture. If the host does not have an SMT architecture, each vCPU will simply be placed on a different core as expected. If the host does have an SMT architecture (i.e. one or more cores have “thread siblings”) then each vCPU will be placed on a different physical core and no vCPUs from other guests will be placed on the same core. As such, one thread sibling is always guaranteed to always be unused.  require: The host must have an SMT architecture. Each vCPU will be allocated on thread siblings. If the host does not have an SMT architecture then it will not be used. If the host has an SMT architecture, but not enough cores with free thread siblings are available, then scheduling will fail.   The image metadata properties will also allow specification of the threads policy:  hw_cpu_thread_policy=prefer|isolate|require  This will only be honored if the flavor specifies the ‘prefer’ policy, either explicitly or implicitly as the defalt option. This ensures that the cloud administrator can have absolute control over threads policy if desired. 
  • 新参数说明:
prefer - 默认策略,如果有SMT技术,则优先把vCPU放在一个Core上; isolate - 把每个vCPU放在不同的Core,且没有其他虚拟机使用这些Core; require - 必须用在有SMT技术的主机上,每个vCPU会被尽量分配在一个Core上。 

####其他参数

  • 内核参数zone_reclaim_mode

当某个节点可用内存不足时,如果为0的话,那么系统会倾向于从远程节点分配内存;如果为1的话,那么系统会倾向于从本地节点回收Cache内存。多数时候,Cache对性能很重要,所以0是一个更好的选择。

$ echo "vm.zone_reclaim_mode = 0" >> /etc/sysctl.conf $ sysctl -p 
  • 内核参数overcommit_memory
可选值:0、1、2。  0:表示内核将检查是否有足够的可用内存供应用进程使用;如果有足够的可用内存,内存申请允许;否则,内存申请失败,并把错误返回给应用进程。  1:表示内核允许分配所有的物理内存,而不管当前的内存状态如何。  2:表示内核允许分配超过所有物理内存和交换空间总和的内存  
  • Nova参数vcpu_pin_set
$ vi /etc/nova/nova.conf  [DEFAULT] vcpu_pin_set = 4-12,^8,15       #Presumably, this would ensure that all instances only run on CPUs 4,5,6,7,9,10,11,12,15 

限定Nova使用哪些CPU核。

过滤器配置

使用NUMATopologyFilter过滤器。

###绑定测试

  • 创建一个有16个vCPU的虚拟机模板:
$ openstack flavor create --vcpus 16 --ram 64 --disk 1 machine.cpu +----------------------------+--------------------------------------+ | Field                      | Value                                | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                | | OS-FLV-EXT-DATA:ephemeral  | 0                                    | | disk                       | 1                                    | | id                         | 82b10589-5a06-4e48-a770-e8c0e275ba4d | | name                       | machine.cpu                          | | os-flavor-access:is_public | True                                 | | properties                 |                                      | | ram                        | 64                                   | | rxtx_factor                | 1.0                                  | | swap                       |                                      | | vcpus                      | 16                                   | +----------------------------+--------------------------------------+ 
  • 配置所有vCPU在一个Node上:
$ nova flavor-key machine.cpu set hw:numa_nodes=1 hw:numa_cpus.0=0-15 hw:numa_mem.0=64 $ openstack flavor show machine.cpu +----------------------------+--------------------------------------------------------------+ | Field                      | Value                                                        | +----------------------------+--------------------------------------------------------------+ | OS-FLV-DISABLED:disabled   | False                                                        | | OS-FLV-EXT-DATA:ephemeral  | 0                                                            | | access_project_ids         | None                                                         | | disk                       | 1                                                            | | id                         | 7dce75ff-ee3a-41c4-a4c8-69416c92f5c1                         | | name                       | machine.cpu                                                  | | os-flavor-access:is_public | True                                                         | | properties                 | hw:numa_cpus.0='0-15', hw:numa_mem.0='64', hw:numa_nodes='1' | | ram                        | 64                                                           | | rxtx_factor                | 1.0                                                          | | swap                       |                                                              | | vcpus                      | 16                                                           | +----------------------------+--------------------------------------------------------------+ 
  • 创建查询CPU ID脚本:
#!/bin/bash  # get processor $1 's core and socket id.   while read line; do     if [[ -z "$line" && $1 = $p_id ]]; then         printf '%-3s %-3s %-3s\n' $p_id $c_id $s_id         break     fi      if echo "$line" | grep -q "^processor"; then         p_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi      if echo "$line" | grep -q "^core id"; then         c_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi      if echo "$line" | grep -q "^physical id"; then         s_id=`echo "$line" | awk -F: '{print $2}' | tr -d ' '`      fi done < /proc/cpuinfo 

####普通虚拟机

  • 使用默认策略创建虚拟机:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.default 
  • 查看虚拟机CPU亲和性:
$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p pid 40296's current affinity list: 0-17,36-53  $ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/40296/task/40296/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40310/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40313/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40314/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40315/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40316/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40317/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40318/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40319/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40320/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40321/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40322/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40323/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40324/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40325/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40326/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40327/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40328/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40329/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40337/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40580/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40587/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40590/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40591/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40902/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/40903/status:Cpus_allowed_list:	0-17,36-53 /proc/40296/task/41007/status:Cpus_allowed_list:	0-17,36-53 
  • 查看虚拟机CPU运行状态:
$ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p    PID PSR COMMAND  40296   - qemu-kvm      -  15 -      -  42 -      -   7 -      -   5 -      -   2 -      -   0 -      -  14 -      -  41 -      -   0 -      -   7 -      -   0 -      -  11 -      -   5 -      -  43 -      -   9 -      -   0 -      -   5 -      -   2 -      -  53 -  $ ps -aux | grep `openstack server show server.cpu.default | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p    PID PSR COMMAND  40296   - qemu-kvm      -  48 -      -  42 -      -   1 -      -  16 -      -   3 -      -   8 -      -   5 -      -   8 -      -  11 -      -   0 -      -   4 -      -   5 -      -   7 -      -   5 -      -   8 -      -   0 -      -  13 -      -  10 -      -  53 - 

可以看到CPU一直在变,且既有在一个Core上的,也有不在一个Core上的。

####avoid虚拟机

  • 设置avoid绑定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=avoid 
  • 创建虚拟机:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.avoid Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <type 'exceptions.ValueError'> (HTTP 500) (Request-ID: req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9) 

无法创建虚拟机。

  • 查看错误日志:
$ tailf /var/lib/docker/volumes/kolla_logs/_data/nova/nova-api.log 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions [req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9 03e0cf5adea04b73a13bc45a0306171b 1b50364d35624d0e8affe0721866fda1 - default default] Unexpected exception in API method 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions Traceback (most recent call last): 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 338, in wrapped 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return f(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return func(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 642, in create 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     **create_kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     rv = f(*args, **kwargs) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 1620, in create 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     check_server_group_quota=check_server_group_quota) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 1186, in _create_instance 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     reservation_id, max_count) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/api.py", line 889, in _validate_and_build_base_options 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     instance_type, image_meta) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/virt/hardware.py", line 1293, in numa_get_constraints 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     cell.cpu_thread_policy = cpu_thread_policy 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 72, in setter 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     field_value = field.coerce(self, name, value) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 195, in coerce 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     return self._type.coerce(obj, attr, value) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions   File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_versionedobjects/fields.py", line 317, in coerce 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions     raise ValueError(msg) 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions ValueError: Field value avoid is invalid 2018-03-21 19:34:03.100 27 ERROR nova.api.openstack.extensions  2018-03-21 19:34:03.102 27 INFO nova.api.openstack.wsgi [req-fbb4aef8-a2bb-47af-bea0-e776c83ae5e9 03e0cf5adea04b73a13bc45a0306171b 1b50364d35624d0e8affe0721866fda1 - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <type 'exceptions.ValueError'> 

aviod参数无效。

####prefer虚拟机

  • 设置prefer绑定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=prefer 
  • 创建虚拟机:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.prefer 
  • 查看虚拟机CPU亲和性:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p pid 187669's current affinity list: 0-2,7,8,14-16,36-38,43,44,50-52  $ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/187669/task/187669/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52 /proc/187669/task/187671/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52 /proc/187669/task/187675/status:Cpus_allowed_list:	43 /proc/187669/task/187676/status:Cpus_allowed_list:	7 /proc/187669/task/187677/status:Cpus_allowed_list:	16 /proc/187669/task/187678/status:Cpus_allowed_list:	52 /proc/187669/task/187679/status:Cpus_allowed_list:	2 /proc/187669/task/187680/status:Cpus_allowed_list:	38 /proc/187669/task/187681/status:Cpus_allowed_list:	8 /proc/187669/task/187682/status:Cpus_allowed_list:	44 /proc/187669/task/187683/status:Cpus_allowed_list:	50 /proc/187669/task/187684/status:Cpus_allowed_list:	14 /proc/187669/task/187685/status:Cpus_allowed_list:	0 /proc/187669/task/187686/status:Cpus_allowed_list:	36 /proc/187669/task/187687/status:Cpus_allowed_list:	51 /proc/187669/task/187688/status:Cpus_allowed_list:	15 /proc/187669/task/187689/status:Cpus_allowed_list:	1 /proc/187669/task/187690/status:Cpus_allowed_list:	37 /proc/187669/task/187692/status:Cpus_allowed_list:	0-2,7-8,14-16,36-38,43-44,50-52 
  • 查看虚拟机CPU运行状态:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh 0   0   0   1   1   0   2   2   0   7   10  0   8   11  0   14  24  0   15  25  0   16  26  0   36  0   0   37  1   0   38  2   0   43  10  0   44  11  0   50  24  0   51  25  0   52  26  0 

虚拟机CPU被绑定,且被分配的Core上都有两个vCPU(这个主机使用了SMT技术,优先这么分配)。

  • 查看内存分配:
$ ps -aux | grep `openstack server show server.cpu.prefer | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl N0        :        22942 (  0.09 GB) N1        :           34 (  0.00 GB) active    :          168 (  0.00 GB) anon      :        19827 (  0.08 GB) dirty     :        19852 (  0.08 GB) kernelpagesize_kB:         1848 (  0.01 GB) mapmax    :         4217 (  0.02 GB) mapped    :         3127 (  0.01 GB) 

####isolate虚拟机

  • 设置isolate绑定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=isolate 
  • 创建虚拟机:
$ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.isolate 
  • 查看虚拟机CPU亲和性:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p pid 51203's current affinity list: 18,19,24-26,32-35,57-59,64-67  $ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/51203/task/51203/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67 /proc/51203/task/51206/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67 /proc/51203/task/51210/status:Cpus_allowed_list:	59 /proc/51203/task/51211/status:Cpus_allowed_list:	65 /proc/51203/task/51212/status:Cpus_allowed_list:	18 /proc/51203/task/51213/status:Cpus_allowed_list:	34 /proc/51203/task/51214/status:Cpus_allowed_list:	24 /proc/51203/task/51215/status:Cpus_allowed_list:	33 /proc/51203/task/51216/status:Cpus_allowed_list:	58 /proc/51203/task/51217/status:Cpus_allowed_list:	67 /proc/51203/task/51218/status:Cpus_allowed_list:	66 /proc/51203/task/51219/status:Cpus_allowed_list:	26 /proc/51203/task/51220/status:Cpus_allowed_list:	35 /proc/51203/task/51221/status:Cpus_allowed_list:	57 /proc/51203/task/51222/status:Cpus_allowed_list:	25 /proc/51203/task/51223/status:Cpus_allowed_list:	19 /proc/51203/task/51224/status:Cpus_allowed_list:	64 /proc/51203/task/51225/status:Cpus_allowed_list:	32 /proc/51203/task/51227/status:Cpus_allowed_list:	18-19,24-26,32-35,57-59,64-67 
  • 查看虚拟机CPU运行状态:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh 18  0   1   19  1   1   24  9   1   25  10  1   26  11  1   32  24  1   33  25  1   34  26  1   35  27  1   57  3   1   58  4   1   59  8   1   64  17  1   65  18  1   66  19  1   67  20  1 

虚拟机CPU被绑定,且每个vCPU被分配到了不同的Core

  • 查看内存分配:
$ ps -aux | grep `openstack server show server.cpu.isolate | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl N0        :         3077 (  0.01 GB) N1        :        22653 (  0.09 GB) active    :          168 (  0.00 GB) anon      :        22581 (  0.09 GB) dirty     :        22600 (  0.09 GB) kernelpagesize_kB:         1844 (  0.01 GB) mapmax    :         4217 (  0.02 GB) mapped    :         3131 (  0.01 GB) 

####require虚拟机

  • 设置require绑定策略:
$ nova flavor-key machine.cpu set hw:cpu_policy=dedicated hw:cpu_thread_policy=require 
  • 创建虚拟机:
$ openstack quota set --cores 100 admin  $ openstack server create --image cirros --flavor machine.cpu --key-name mykey --nic net-id=8d01509e-4a3a-497a-9118-3827c1e37672 --availability-zone az01:osdev-01 server.cpu.require 
  • 查看虚拟机CPU亲和性:
$ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs taskset -c -p pid 194063's current affinity list: 3,5,6,9-13,39,41,42,45-49  $ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} find /proc/{}/task/ -name "status" | xargs grep Cpus_allowed_list /proc/194063/task/194063/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49 /proc/194063/task/194065/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49 /proc/194063/task/194069/status:Cpus_allowed_list:	10 /proc/194063/task/194070/status:Cpus_allowed_list:	46 /proc/194063/task/194071/status:Cpus_allowed_list:	11 /proc/194063/task/194072/status:Cpus_allowed_list:	47 /proc/194063/task/194073/status:Cpus_allowed_list:	42 /proc/194063/task/194074/status:Cpus_allowed_list:	6 /proc/194063/task/194075/status:Cpus_allowed_list:	41 /proc/194063/task/194076/status:Cpus_allowed_list:	5 /proc/194063/task/194077/status:Cpus_allowed_list:	9 /proc/194063/task/194078/status:Cpus_allowed_list:	45 /proc/194063/task/194079/status:Cpus_allowed_list:	3 /proc/194063/task/194080/status:Cpus_allowed_list:	39 /proc/194063/task/194081/status:Cpus_allowed_list:	48 /proc/194063/task/194082/status:Cpus_allowed_list:	12 /proc/194063/task/194083/status:Cpus_allowed_list:	49 /proc/194063/task/194084/status:Cpus_allowed_list:	13 /proc/194063/task/194088/status:Cpus_allowed_list:	3,5-6,9-13,39,41-42,45-49 
  • 查看虚拟机CPU运行状态:
ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs ps -m -o pid,psr,comm -p | awk 'NR>2{print $2}' | sort -n | uniq | xargs -n1 ./cpu_id.sh 3   3   0   5   8   0   6   9   0   9   16  0   10  17  0   11  18  0   12  19  0   13  20  0   39  3   0   41  8   0   42  9   0   45  16  0   46  17  0   47  18  0   48  19  0   49  20  0 

虚拟机CPU被绑定,且被分配的Core上都有两个vCPU,和prefer类似,且没有被分配在和isolate策略相同的Core上面。

  • 查看内存分配:
$ ps -aux | grep `openstack server show server.cpu.require | grep instance_name | awk '{print $4}'` | awk '{{if($11=="/usr/libexec/qemu-kvm") {print $2}}}' | xargs -I {} cat /proc/{}/numa_maps | perl numa-maps-summary.pl N0        :        22939 (  0.09 GB) N1        :           34 (  0.00 GB) active    :          168 (  0.00 GB) anon      :        19824 (  0.08 GB) dirty     :        19850 (  0.08 GB) kernelpagesize_kB:         1848 (  0.01 GB) mapmax    :         4217 (  0.02 GB) mapped    :         3127 (  0.01 GB) 

相关源码

详见NUMA绑定源码。

参考文档

本文发表于2018年03月23日 22:38
(c)注:本文转载自https://my.oschina.net/LastRitter/blog/1649954,转载目的在于传递更多信息,并不代表本网赞同其观点和对其真实性负责。如有侵权行为,请联系我们,我们会及时删除.

阅读 5129 讨论 0 喜欢 2

抢先体验

扫码体验
趣味小程序
文字表情生成器

闪念胶囊

你要过得好哇,这样我才能恨你啊,你要是过得不好,我都不知道该恨你还是拥抱你啊。

直抵黄龙府,与诸君痛饮尔。

那时陪伴我的人啊,你们如今在何方。

不出意外的话,我们再也不会见了,祝你前程似锦。

这世界真好,吃野东西也要留出这条命来看看

快捷链接
网站地图
提交友链
Copyright © 2016 - 2021 Cion.
All Rights Reserved.
京ICP备2021004668号-1