Geneve: Generic Network Virtualization Encapsulationjesse@kernel.orgIntel Corporation2200 Mission College Blvd.Santa ClaraCA95054United States of Americailango.s.ganga@intel.comVMware, Inc.3401 Hillview Ave.Palo AltoCA94304United States of Americatsridhar@utexas.eduoverlaytunnelextensiblevariablemetadataoptionsendpointtransit
Network virtualization involves the cooperation of devices with a
wide variety of capabilities such as software and hardware tunnel
endpoints, transit fabrics, and centralized control clusters. As a
result of their role in tying together different elements of the
system, the requirements on tunnels are influenced by all of these
components. Therefore, flexibility is the most important aspect of a
tunneling protocol if it is to keep pace with the evolution of technology.
This document describes Geneve, an encapsulation protocol designed to
recognize and accommodate these changing capabilities and needs.Introduction
Networking has long featured a variety of tunneling, tagging, and
other encapsulation mechanisms. However, the advent of network
virtualization has caused a surge of renewed interest and a
corresponding increase in the introduction of new protocols. The
large number of protocols in this space -- for example, ranging all the way from
VLANs and MPLS through the more recent
VXLAN (Virtual eXtensible Local Area Network)
and NVGRE (Network Virtualization
Using Generic Routing Encapsulation) -- often
leads to questions about the need for new encapsulation formats and
what it is about network virtualization in particular that leads to
their proliferation. Note that the list of protocols presented above is non-exhaustive.
While many encapsulation protocols seek to simply partition the
underlay network or bridge two domains, network
virtualization views the transit network as providing connectivity
between multiple components of a distributed system. In many ways,
this system is similar to a chassis switch with the IP underlay
network playing the role of the backplane and tunnel endpoints on the
edge as line cards. When viewed in this light, the requirements
placed on the tunneling protocol are significantly different in terms of
the quantity of metadata necessary and the role of transit nodes.
Work such as "VL2: A Scalable and Flexible Data Center Network" and "NVO3 Data Plane Requirements"
have described some of the properties that the data plane must have to support network
virtualization. However, one additional defining requirement is the
need to carry metadata (e.g., system state) along with the packet data;
example use cases of metadata are noted below. The use of
some metadata is certainly not a foreign concept -- nearly all
protocols used for network virtualization have at least 24 bits of identifier
space as a way to partition between tenants. This is often described
as overcoming the limits of 12-bit VLANs; when seen in that
context or any context where it is a true tenant identifier, 16
million possible entries is a large number. However, the reality is
that the metadata is not exclusively used to identify tenants, and
encoding other information quickly starts to crowd the space. In
fact, when compared to the tags used to exchange metadata between
line cards on a chassis switch, 24-bit identifiers start to look
quite small. There are nearly endless uses for this metadata,
ranging from storing input port identifiers for simple security policies to
sending service-based context for advanced middlebox applications
that terminate and re-encapsulate Geneve traffic.
Existing tunneling protocols have each attempted to solve different
aspects of these new requirements only to be quickly rendered out of
date by changing control plane implementations and advancements.
Furthermore, software and hardware components and controllers all
have different advantages and rates of evolution -- a fact that should
be viewed as a benefit, not a liability or limitation. This document describes Geneve, a protocol that seeks to avoid these problems by
providing a framework for tunneling for network virtualization rather
than being prescriptive about the entire system.Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14 when, and only when, they appear in all
capitals, as shown here.Terminology
The Network
Virtualization over Layer 3 (NVO3) Framework defines many of the concepts commonly
used in network virtualization. In addition, the following terms are
specifically meaningful in this document:
Checksum offload:
An optimization implemented by many NICs (Network Interface Controllers)
that enables computation and verification of upper-layer protocol
checksums in hardware on transmit and receive, respectively. This
typically includes IP and TCP/UDP checksums that would otherwise be
computed by the protocol stack in software.
Clos network:
A technique for composing network fabrics larger than
a single switch while maintaining non-blocking bandwidth across
connection points. ECMP is used to divide traffic across the
multiple links and switches that constitute the fabric. Sometimes
termed "leaf and spine" or "fat tree" topologies.
ECMP:
Equal Cost Multipath. A routing mechanism for selecting from
among multiple best next-hop paths by hashing packet headers in order
to better utilize network bandwidth while avoiding reordering of packets
within a flow.
Geneve:
Generic Network Virtualization Encapsulation. The tunneling
protocol described in this document.
LRO:
Large Receive Offload. The receiver-side equivalent function
of LSO, in which multiple protocol segments (primarily TCP) are coalesced into
larger data units.
LSO:
Large Segmentation Offload. A function provided by many
commercial NICs that allows data units larger than the MTU to be
passed to the NIC to improve performance, the NIC being responsible
for creating smaller segments of a size less than or equal to the MTU
with correct protocol headers. When referring specifically to TCP/IP, this
feature is often known as TSO (TCP Segmentation Offload).
Middlebox:
In the context of this document, the term "middlebox" refers to network
service functions or service interposition appliances that typically implement tunnel endpoint functionality, terminating and re-encapsulating Geneve traffic.
NIC:
Network Interface Controller. Also called "Network Interface Card" or "Network Adapter".
A NIC could be part of a tunnel endpoint or transit device and can either
process or aid in the processing of Geneve packets.
Transit device:
A forwarding element (e.g., router or switch) along the path of the tunnel
making up part of the underlay network. A transit device may be
capable of understanding the Geneve packet format but does not
originate or terminate Geneve packets.
Tunnel endpoint:
A component performing encapsulation and
decapsulation of packets, such as Ethernet frames or IP datagrams, in
Geneve headers. As the ultimate consumer of any tunnel metadata,
tunnel endpoints have the highest level of requirements for parsing and
interpreting tunnel headers. Tunnel endpoints may consist of either
software or hardware implementations or a combination of the two.
Tunnel endpoints are frequently a component of a Network Virtualization Edge (NVE)
but may also be found in middleboxes or other elements making up an NVO3 network.
VM:
Virtual Machine.
Design Requirements
Geneve is designed to support network virtualization use cases for data center environments. In these situations,
tunnels are typically established to act as a backplane between the
virtual switches residing in hypervisors, physical switches, or
middleboxes or other appliances. An arbitrary IP network can be used
as an underlay, although Clos networks composed using ECMP links are a
common choice to provide consistent bisectional bandwidth across all
connection points. Many of the concepts of network virtualization overlays
over IP networks are described in the NVO3 Framework .
shows an example of a
hypervisor, a top-of-rack switch for connectivity to physical servers, and a WAN uplink
connected using Geneve tunnels over a simplified Clos network. These
tunnels are used to encapsulate and forward frames from the attached
components, such as VMs or physical links.
To support the needs of network virtualization, the tunneling protocol
should be able to take advantage of the differing (and evolving)
capabilities of each type of device in both the underlay and overlay
networks. This results in the following requirements being placed on
the data plane tunneling protocol:
The data plane is generic and extensible enough to support current
and future control planes.
Tunnel components are efficiently implementable in both hardware
and software without restricting capabilities to the lowest common
denominator.
High performance over existing IP fabrics is maintained.
These requirements are described further in the following
subsections.Control Plane Independence
Although some protocols for network virtualization have included a
control plane as part of the tunnel format specification (most
notably, VXLAN prescribed a multicast-learning-based control plane), these specifications have largely been treated
as describing only the data format. The VXLAN packet format has
actually seen a wide variety of control planes built on top of it.
There is a clear advantage in settling on a data format: most of the
protocols are only superficially different and there is little
advantage in duplicating effort. However, the same cannot be said of
control planes, which are diverse in very fundamental ways. The case
for standardization is also less clear given the wide variety in
requirements, goals, and deployment scenarios.
As a result of this reality, Geneve is a pure tunnel format
specification that is capable of fulfilling the needs of many control
planes by explicitly not selecting any one of them. This
simultaneously promotes a shared data format and reduces the
chance of obsolescence by future control plane
enhancements.Data Plane Extensibility
Achieving the level of flexibility needed to support current and
future control planes effectively requires an options infrastructure
to allow new metadata types to be defined, deployed, and either
finalized or retired. Options also allow for differentiation of
products by encouraging independent development in each vendor's core
specialty, leading to an overall faster pace of advancement. By far,
the most common mechanism for implementing options is the Type-Length-Value (TLV) format.
It should be noted that, while options can be used to support non-wirespeed
control packets, they are equally important in data packets
as well for segregating and directing forwarding. (For instance, the
examples given before regarding input-port-based security policies and
terminating/re-encapsulating service interposition both require tags
to be placed on data packets.) Therefore, while it would be desirable to limit the
extensibility to only control packets for the purposes of simplifying
the datapath, that would not satisfy the design requirements.Efficient Implementation
There is often a conflict between software flexibility and hardware
performance that is difficult to resolve. For a given set of
functionality, it is obviously desirable to maximize performance.
However, that does not mean new features that cannot be run at a desired
speed today should be disallowed. Therefore, for a protocol to be considered
efficiently implementable, it is expected to have a set of common capabilities that can
be reasonably handled across platforms as well as a graceful
mechanism to handle more advanced features in the appropriate
situations.
The use of a variable-length header and options in a protocol often
raises questions about whether the protocol is truly efficiently
implementable in hardware. To answer this question in the context of Geneve, it is
important to first divide "hardware" into two categories: tunnel
endpoints and transit devices.
Tunnel endpoints must be able to parse the variable-length header, including any
options, and take action. Since these devices are actively
participating in the protocol, they are the most affected by Geneve.
However, as tunnel endpoints are the ultimate consumers of the data,
transmitters can tailor their output to the capabilities of the
recipient.
Transit devices may be able to interpret the options; however,
as non-terminating devices, transit devices
do not originate or terminate the Geneve packet. Hence, they MUST NOT modify Geneve headers and
MUST NOT insert or delete options, as that is the responsibility of tunnel endpoints.
Options, if present in the packet, MUST only be generated and terminated by tunnel endpoints.
The participation of transit devices in interpreting options is
OPTIONAL.
Further, either tunnel endpoints or transit devices MAY use offload
capabilities of NICs, such as checksum offload, to improve the
performance of Geneve packet processing. The presence of a Geneve
variable-length header should not prevent the tunnel endpoints and
transit devices from using such offload capabilities.Use of Standard IP Fabrics
IP has clearly cemented its place as the dominant transport mechanism,
and many techniques have evolved over time to make it robust,
efficient, and inexpensive. As a result, it is natural to use IP
fabrics as a transit network for Geneve. Fortunately, the use of IP
encapsulation and addressing is enough to achieve the primary goal of
delivering packets to the correct point in the network through
standard switching and routing.
In addition, nearly all underlay fabrics are designed to exploit
parallelism in traffic to spread load across multiple links without
introducing reordering in individual flows. These ECMP techniques typically involve parsing and hashing
the addresses and port numbers from the packet to select an outgoing
link. However, the use of tunnels often results in poor ECMP
performance, as without additional knowledge of the protocol, the
encapsulated traffic is hidden from the fabric by design, and only
tunnel endpoint addresses are available for hashing.
Since it is desirable for Geneve to perform well on these existing
fabrics, it is necessary for entropy from encapsulated packets to be
exposed in the tunnel header. The most common technique for this is
to use the UDP source port, which is discussed further in
.Geneve Encapsulation Details
The Geneve packet format consists of a compact tunnel header
encapsulated in UDP over either IPv4 or IPv6. A small fixed tunnel
header provides control information plus a base level of
functionality and interoperability with a focus on simplicity. This
header is then followed by a set of variable-length options to allow for
future innovation. Finally, the payload consists of a protocol data
unit of the indicated type, such as an Ethernet frame. Sections
and illustrate the Geneve packet format transported (for
example) over Ethernet along with an Ethernet payload.Geneve Packet Format over IPv4Geneve Packet Format over IPv6UDP Header
The use of an encapsulating UDP header follows the
connectionless semantics of Ethernet and IP in addition to providing
entropy to routers performing ECMP. Therefore, header fields are
interpreted as follows:
Source Port:
A source port selected by the originating tunnel endpoint. This source port SHOULD be the same for all packets
belonging to a single encapsulated flow to prevent reordering due
to the use of different paths. To encourage an even distribution
of flows across multiple links, the source port SHOULD be
calculated using a hash of the encapsulated packet headers using,
for example, a traditional 5-tuple. Since the port represents a
flow identifier rather than a true UDP connection, the entire
16-bit range MAY be used to maximize entropy. In addition to setting the source port,
for IPv6, the flow label MAY also be used for providing entropy. For an example of
using the IPv6 flow label for tunnel use cases, see .
If Geneve traffic is shared with other UDP listeners
on the same IP address, tunnel endpoints SHOULD implement a mechanism
to ensure ICMP return traffic arising from network errors is directed
to the correct listener. The definition of such a mechanism is beyond
the scope of this document.
Dest Port:
IANA has assigned port 6081 as the fixed well-known destination port
for Geneve. Although the well-known value should be used by default, it is RECOMMENDED that implementations make
this configurable. The chosen port is used for identification of
Geneve packets and MUST NOT be reversed for different ends of a
connection as is done with TCP. It is the responsibility of the control plane to manage any reconfiguration of the assigned port and its interpretation by respective devices.
The definition of the control plane is beyond the scope of this document.
UDP Length:
The length of the UDP packet including the UDP header.
UDP Checksum:
In order to protect the Geneve header, options, and payload from
potential data corruption, the UDP checksum SHOULD be generated as
specified in and when
Geneve is encapsulated in IPv4. To protect the IP header, Geneve header,
options, and payload from potential data corruption, the UDP checksum MUST
be generated by default as specified in
and when Geneve
is encapsulated in IPv6, except under certain conditions, which are outlined in the next paragraph.
Upon receiving such packets with a non-zero UDP checksum,
the receiving tunnel endpoints MUST validate the checksum.
If the checksum is not correct, the packet MUST be dropped; otherwise,
the packet MUST be accepted for decapsulation.
Under certain conditions, the UDP checksum MAY be set to zero on transmit
for packets encapsulated in both IPv4 and IPv6 .
See for additional
requirements that apply when using zero
UDP checksum with IPv4 and IPv6. Disabling the use of UDP checksums is
an operational consideration that should take into account the risks
and effects of packet corruption.
Tunnel Header Fields
Ver (2 bits):
The current version number is 0. Packets received by a tunnel endpoint with an unknown version MUST be dropped. Transit
devices interpreting Geneve packets with an unknown
version number MUST treat them as UDP packets with an unknown
payload.
Opt Len (6 bits):
The length of the option fields, expressed in 4-byte multiples, not including the 8-byte fixed tunnel
header. This results in a minimum total Geneve header size of 8
bytes and a maximum of 260 bytes. The start of the payload
headers can be found using this offset from the end of the base
Geneve header.
Transit devices MUST maintain consistent forwarding behavior
irrespective of the value of 'Opt Len', including ECMP link
selection.
O (1 bit):
Control packet. This packet contains a control message. Control messages are sent between tunnel endpoints.
Tunnel endpoints MUST NOT forward the payload,
and transit devices MUST NOT attempt to interpret it.
Since control messages are less frequent, it is RECOMMENDED
that tunnel endpoints direct these packets to a high-priority control
queue (for example, to direct the packet to a general purpose CPU
from a forwarding Application-Specific Integrated Circuit (ASIC) or to separate out control traffic on a
NIC). Transit devices MUST NOT alter forwarding behavior on the
basis of this bit, such as ECMP link selection.
C (1 bit):
Critical options present. One or more options has the critical bit set (see ). If this bit is set, then
tunnel endpoints MUST parse the options list to interpret any
critical options. On tunnel endpoints where option parsing is not
supported, the packet MUST be dropped on the basis of the 'C' bit
in the base header. If the bit is not set, tunnel endpoints MAY
strip all options using 'Opt Len' and forward the decapsulated
packet. Transit devices MUST NOT drop packets on the
basis of this bit.
Rsvd. (6 bits):
Reserved field, which MUST be zero on transmission and MUST be ignored on receipt.
Protocol Type (16 bits):
The type of protocol data unit appearing after the Geneve header. This follows the Ethertype
convention, with Ethernet itself being represented by the
value 0x6558.
Virtual Network Identifier (VNI) (24 bits):
An identifier for a unique element of a virtual network. In many situations, this may
represent an L2 segment; however, the control plane defines the
forwarding semantics of decapsulated packets. The VNI MAY be used
as part of ECMP forwarding decisions or MAY be used as a mechanism
to distinguish between overlapping address spaces contained in the
encapsulated packet when load balancing across CPUs.
Reserved (8 bits):
Reserved field, which MUST be zero on transmission and ignored on receipt.
Tunnel Options
The base Geneve header is followed by zero or more options in Type-Length-Value format. Each option consists of a 4-byte option
header and a variable amount of option data interpreted according to
the type.
Option Class (16 bits):
Namespace for the 'Type' field. IANA has created a "Geneve Option Class" registry to
allocate identifiers for organizations, technologies, and vendors
that have an interest in creating types for options. Each
organization may allocate types independently to allow
experimentation and rapid innovation. It is expected that, over
time, certain options will become well known, and a given
implementation may use option types from a variety of sources. In
addition, IANA has reserved specific ranges for
allocation by IETF Review and for Experimental Use (see ).
Type (8 bits):
Type indicating the format of the data contained in this option. Options are primarily designed to encourage future
extensibility and innovation, and standardized forms of these
options will be defined in separate documents.
The high-order bit of the option type indicates that this is a
critical option. If the receiving tunnel endpoint does not recognize
the option and this bit is set, then the packet MUST be dropped.
If this bit is set in any option, then the 'C' bit in the
Geneve base header MUST also be set. Transit devices MUST NOT
drop packets on the basis of this bit. The following figure shows
the location of the 'C' bit in the 'Type' field:
The requirement to drop a packet with an unknown option with the 'C' bit set
applies to the entire tunnel endpoint system and not a particular
component of the implementation. For example, in a system
comprised of a forwarding ASIC and a general purpose CPU, this
does not mean that the packet must be dropped in the ASIC. An
implementation may send the packet to the CPU using a rate-limited
control channel for slow-path exception handling.
R (3 bits):
Option control flags reserved for future use. These bits MUST be
zero on transmission and MUST be ignored on receipt.
Length (5 bits):
Length of the option, expressed in 4-byte
multiples, excluding the option header. The total length of each
option may be between 4 and 128 bytes. A value of 0 in the 'Length' field implies
an option with only an option header and no option data. Packets in which the total
length of all options is not equal to the 'Opt Len' in the base
header are invalid and MUST be silently dropped if received by a
tunnel endpoint that processes the options.
Variable-Length Option Data:
Option data interpreted according to 'Type'.
Options Processing
Geneve options are intended to be originated and processed
by tunnel endpoints. However, options MAY be interpreted by transit
devices along the tunnel path. Transit devices not
interpreting Geneve headers (which may or may not include options) MUST handle
Geneve packets as any other UDP packet and maintain consistent forwarding behavior.
In tunnel endpoints, the generation and interpretation of options is
determined by the control plane, which is beyond the scope of this
document. However, to ensure interoperability between heterogeneous
devices, some requirements are imposed on options and the devices that
process them:
Receiving tunnel endpoints MUST drop packets containing unknown options
with the 'C' bit set in the option type. Conversely, transit
devices MUST NOT drop packets as a result of encountering unknown
options, including those with the 'C' bit set.
The contents of the options and their ordering MUST NOT be
modified by transit devices.
If a tunnel endpoint receives a Geneve packet with an 'Opt Len' (the total length of all options)
that exceeds the options-processing capability of the tunnel endpoint, then
the tunnel endpoint MUST drop such packets. An implementation may raise an
exception to the control plane in such an event. It is the responsibility
of the control plane to ensure the communicating peer tunnel endpoints
have the processing capability to handle the total length of options.
The definition of the control plane is beyond the scope of this document.
When designing a Geneve option, it is important to consider how the
option will evolve in the future. Once an option is defined, it is
reasonable to expect that implementations may come to depend on a
specific behavior. As a result, the scope of any future changes must
be carefully described upfront.
Architecturally, options are intended to be self descriptive and independent.
This enables parallelism in options processing and reduces implementation complexity.
However, the control plane may impose certain ordering restrictions, as
described in .
Unexpectedly significant interoperability issues may result from
changing the length of an option that was defined to be a certain
size. A particular option is specified to have either a fixed
length, which is constant, or a variable length, which may change
over time or for different use cases. This property is part of the
definition of the option and is conveyed by the 'Type'. For fixed-length options, some implementations may choose to ignore the 'Length'
field in the option header and instead parse based on the well-known
length associated with the type. In this case, redefining the length
will impact not only the parsing of the option in question but also any
options that follow. Therefore, options that are defined to be a fixed
length in size MUST NOT be redefined to a different length. Instead,
a new 'Type' should be allocated. Actual definition of the option type is beyond
the scope of this document. The option type and its interpretation should be
defined by the entity that owns the option class.
Options may be processed by NIC hardware utilizing offloads (e.g., LSO and LRO)
as described in . Careful consideration should be
given to how the offload capabilities outlined in
impact an option's design.
Implementation and Deployment ConsiderationsApplicability Statement
Geneve is a UDP-based network virtualization overlay encapsulation protocol
designed to establish tunnels between NVEs over an existing IP network.
It is intended for use in public or private data center environments,
for deploying multi-tenant overlay networks over an existing IP underlay network.
As a UDP-based protocol, Geneve adheres
to the UDP usage guidelines as specified in .
The applicability of these guidelines is dependent on the underlay
IP network and the nature of the Geneve payload protocol
(for example, TCP/IP, IP/Ethernet).
Geneve is intended to be deployed in a data center network environment
operated by a single operator or an adjacent set of cooperating network
operators that fits with the definition of controlled environments
in . A network in a controlled environment can be
managed to operate under certain conditions, whereas in the general
Internet, this cannot be done. Hence, requirements for a tunneling
protocol operating under a controlled environment can be less
restrictive than the requirements of the general Internet.
For the purpose of this document, a traffic-managed controlled environment
(TMCE) is defined as an IP network that is traffic engineered and/or otherwise
managed (e.g., via use of traffic rate limiters) to avoid congestion. The concept
of a TMCE is outlined in . Significant portions of the text
in through are based
on as applicable to Geneve.
It is the responsibility of the operator to ensure that the guidelines/requirements
in this section are followed as applicable to their Geneve deployment(s).Congestion-Control Functionality
Geneve does not natively provide congestion-control functionality and relies
on the payload protocol traffic for congestion control. As such, Geneve MUST
be used with congestion-controlled traffic or within a TMCE to avoid congestion. An operator of a TMCE may avoid congestion through careful provisioning
of their networks, rate-limiting user data traffic, and managing traffic
engineering according to path capacity.UDP Checksum
The outer UDP checksum SHOULD be used with Geneve when transported
over IPv4; this is to provide integrity for the Geneve headers,
options, and payload in case of data corruption (for example, to
avoid misdelivery of the payload to different tenant systems). The UDP checksum provides a statistical guarantee
that a payload was not corrupted in transit. These integrity checks are not
strong from a coding or cryptographic perspective and are not designed to
detect physical-layer errors or malicious modification of the datagram
(see ). In deployments where such a risk exists,
an operator SHOULD use additional data integrity
mechanisms such as those offered
by IPsec (see ).
An operator MAY choose to disable UDP checksums
and use zero UDP checksum if Geneve packet integrity is provided by other data
integrity mechanisms, such as IPsec or additional checksums, or if one of
the conditions (a, b, or c) in is met.
By default, UDP checksums MUST be used when Geneve is transported over IPv6.
A tunnel endpoint MAY be configured for use with zero UDP checksum if
additional requirements in are met.Zero UDP Checksum Handling with IPv6
When Geneve is used over IPv6, the UDP checksum is used to protect IPv6 headers,
UDP headers, and Geneve headers, options, and payload from potential data corruption.
As such, by default, Geneve MUST use UDP checksums when transported over IPv6.
An operator MAY choose to configure zero UDP checksum if
operating in a TMCE as stated in
if one of the following conditions is met.
It is known that packet corruption is exceptionally
unlikely (perhaps based on knowledge of equipment types in their underlay
network) and the operator is willing to risk undetected packet
corruption.
It is judged through observational measurements (perhaps through historic
or current traffic flows that use non-zero checksum) that the level of packet
corruption is tolerably low and is where the operator is willing to risk undetected corruption.
The Geneve payload is carrying applications that are tolerant of misdelivered
or corrupted packets (perhaps through higher-layer checksum validation
and/or reliability through retransmission).
In addition, Geneve tunnel implementations using zero UDP checksum MUST meet
the following requirements:
Use of UDP checksum over IPv6 MUST be the default
configuration for all Geneve tunnels.
If Geneve is used with zero UDP checksum over IPv6, then such
a tunnel
endpoint implementation MUST meet all the requirements specified
in and requirement 1 as specified in since it is relevant to Geneve.
The Geneve tunnel endpoint that decapsulates the tunnel
SHOULD check that the
source and destination IPv6 addresses are valid for the Geneve tunnel that
is configured to receive zero UDP checksum and discard other packets
for which such a check fails.
The Geneve tunnel endpoint that encapsulates the tunnel MAY use different
IPv6 source addresses for each Geneve tunnel that uses zero UDP checksum mode
in order to strengthen the decapsulator's check of the IPv6 source address
(i.e., the same IPv6 source address is not to be used with more than one IPv6
destination address, irrespective of whether that destination address is
a unicast or multicast address). When this is not possible, it is RECOMMENDED
to use each source address for as few Geneve tunnels that use zero UDP
checksum as is feasible.
Note that for requirements 3 and 4, the receiving tunnel endpoint can apply
these checks only if it has out-of-band knowledge that the encapsulating tunnel
endpoint is applying the indicated behavior. One possibility to obtain this out-of-band
knowledge is through signaling by the control plane. The definition of
the control plane is beyond the scope of this document.
Measures SHOULD be taken to prevent Geneve traffic over IPv6 with zero UDP
checksum from escaping into the general Internet. Examples of such measures include
employing packet filters at the gateways or edge of the Geneve network and/or
keeping logical or physical separation of the Geneve network from networks
carrying general Internet traffic.
The above requirements do not change the requirements
specified in either or
.
The use of the source IPv6 address in addition to the
destination IPv6 address, plus the recommendation against
reuse of source IPv6 addresses among Geneve tunnels, collectively
provide some mitigation for the absence of UDP checksum coverage of
the IPv6 header. A traffic-managed controlled environment that satisfies
at least one of the three conditions listed at the beginning of
this section provides additional assurance.
Encapsulation of Geneve in IP
As an IP-based tunneling protocol, Geneve shares many properties and
techniques with existing protocols. The application of some of these
are described in further detail, although, in general, most concepts
applicable to the IP layer or to IP tunnels generally also function
in the context of Geneve.IP Fragmentation
It is RECOMMENDED that Path MTU Discovery (see and ) be used to prevent or minimize fragmentation.
The use of Path MTU Discovery on the transit network provides the
encapsulating tunnel endpoint with soft-state information about the link that it may use
to prevent or minimize fragmentation depending on its role in the
virtualized network. The NVE can maintain this state (the MTU size of
the tunnel link(s) associated with the tunnel endpoint), so if a
tenant system sends large packets that, when encapsulated, exceed the
MTU size of the tunnel link, the tunnel endpoint can discard such
packets and send exception messages to the tenant system(s). If the
tunnel endpoint is associated with a routing or forwarding function and/or has the capability
to send ICMP messages, the encapsulating tunnel endpoint MAY send ICMP fragmentation
needed or Packet Too Big messages to the tenant system(s).
When determining the MTU size of a tunnel link, the maximum length of options MUST be assumed as options may vary
on a per-packet basis. Recommendations and guidance for handling fragmentation in
similar overlay encapsulation services like Pseudowire Emulation
Edge-to-Edge (PWE3) are provided in .
Note that some implementations may not be capable of supporting
fragmentation or other less common features of the IP header, such as
options and extension headers. Some of the issues associated
with MTU size and fragmentation in IP tunneling and use of ICMP messages are
outlined in .DSCP, ECN, and TTL
When encapsulating IP (including over Ethernet) packets in Geneve,
there are several considerations for propagating Differentiated Services
Code Point (DSCP) and Explicit Congestion Notification (ECN) bits
from the inner header to the tunnel on transmission and the reverse
on reception. provides guidance for mapping DSCP between inner and outer
IP headers. Network virtualization is typically more closely aligned
with the Pipe model described, where the DSCP value on the tunnel
header is set based on a policy (which may be a fixed value, one
based on the inner traffic class or some other mechanism for
grouping traffic). Aspects of the Uniform model (which treats the
inner and outer DSCP values as a single field by copying on ingress
and egress) may also apply, such as the ability to re-mark the inner
header on tunnel egress based on transit marking. However, the
Uniform model is not conceptually consistent with network
virtualization, which seeks to provide strong isolation between
encapsulated traffic and the physical network. describes the mechanism for exposing ECN capabilities on IP
tunnels and propagating congestion markers to the inner packets.
This behavior MUST be followed for IP packets encapsulated in Geneve.
Though either the Uniform or Pipe models could be used for handling TTL (or Hop Limit in case of IPv6) when tunneling IP packets, the Pipe model is more consistent with network virtualization.
provides guidance on handling TTL between inner IP header and outer IP tunnels;
this model is similar to the Pipe model and is RECOMMENDED for
use with Geneve for network virtualization applications.Broadcast and Multicast
Geneve tunnels may either be point-to-point unicast between two
tunnel endpoints or utilize broadcast or multicast addressing. It is
not required that inner and outer addressing match in this respect.
For example, in physical networks that do not support multicast,
encapsulated multicast traffic may be replicated into multiple
unicast tunnels or forwarded by policy to a unicast location
(possibly to be replicated there).
With physical networks that do support multicast, it may be desirable
to use this capability to take advantage of hardware replication for
encapsulated packets. In this case, multicast addresses may be
allocated in the physical network corresponding to tenants,
encapsulated multicast groups, or some other factor. The allocation
of these groups is a component of the control plane and, therefore,
is beyond the scope of this document.
When physical multicast is in
use, devices with heterogeneous capabilities may be present in the same group.
Some options may only be interpretable by a subset of the devices in the group.
Other devices can safely ignore such options unless the 'C' bit is set to
mark the unknown option as critical. The requirements outlined in
apply for critical options.
In addition, provides examples of various mechanisms that can
be used for multicast handling in network virtualization overlay networks.Unidirectional Tunnels
Generally speaking, a Geneve tunnel is a unidirectional concept. IP
is not a connection-oriented protocol, and it is possible for two
tunnel endpoints to communicate with each other using different paths or to
have one side not transmit anything at all. As Geneve is an IP-based
protocol, the tunnel layer inherits these same characteristics.
It is possible for a tunnel to encapsulate a protocol, such as TCP,
that is connection oriented and maintains session state at that
layer. In addition, implementations MAY model Geneve tunnels as
connected, bidirectional links, for example, to provide the abstraction of
a virtual port. In both of these cases, bidirectionality of the
tunnel is handled at a higher layer and does not affect the operation
of Geneve itself.Constraints on Protocol Features
Geneve is intended to be flexible for use with a wide range of current and
future applications. As a result, certain constraints may be placed
on the use of metadata or other aspects of the protocol in order to
optimize for a particular use case. For example, some applications
may limit the types of options that are supported or enforce a
maximum number or length of options. Other applications may only
handle certain encapsulated payload types, such as Ethernet or IP.
These optimizations can be implemented either globally (throughout
the system) or locally (for example, restricted to certain classes
of devices or network paths).
These constraints may be communicated to tunnel endpoints either
explicitly through a control plane or implicitly by the nature of the
application. As Geneve is defined as a data plane protocol that is
control plane agnostic, definition of such mechanisms is beyond the scope of this
document.Constraints on Options
While Geneve options are flexible, a control plane may restrict
the number of option TLVs as well as the order and size of the TLVs
between tunnel endpoints to make it simpler for a data plane
implementation in software or hardware to handle (see ).
For example, there may be some critical information, such as a secure
hash, that must be processed in a certain order to provide the lowest
latency, or there may be other scenarios where the options must be
processed in a given order due to protocol semantics.
A control plane may negotiate a subset of option TLVs and certain TLV
ordering; it may also limit the total number of option TLVs present
in the packet, for example, to accommodate hardware capable of
processing fewer options. Hence, a control plane
needs to have the ability to describe the supported TLV subset and
its ordering to the tunnel endpoints. In the absence of a control
plane, alternative configuration mechanisms may be used for this
purpose. Such mechanisms are beyond the scope of this document.NIC Offloads
Modern NICs currently provide a variety of offloads to enable the
efficient processing of packets. The implementation of many of these
offloads requires only that the encapsulated packet be easily parsed
(for example, checksum offload). However, optimizations such as LSO
and LRO involve some processing of the options themselves since they
must be replicated/merged across multiple packets. In these
situations, it is desirable not to require changes to the offload
logic to handle the introduction of new options. To enable this,
some constraints are placed on the definitions of options to allow
for simple processing rules:
When performing LSO, a NIC MUST replicate the entire Geneve header
and all options, including those unknown to the device, onto each
resulting segment unless an option allows an exception.
Conversely, when performing LRO, a NIC may assume that a
binary comparison of the options (including unknown options) is
sufficient to ensure equality and MAY merge packets with equal
Geneve headers.
Options MUST NOT be reordered during the course of offload
processing, including when merging packets for the purpose of LRO.
NICs performing offloads MUST NOT drop packets with unknown
options, including those marked as critical, unless explicitly configured to do so.
There is no requirement that a given implementation of Geneve employ
the offloads listed as examples above. However, as these offloads
are currently widely deployed in commercially available NICs, the
rules described here are intended to enable efficient handling of
current and future options across a variety of devices.Inner VLAN Handling
Geneve is capable of encapsulating a wide range of protocols; therefore, a given implementation is likely to support only a small
subset of the possibilities. However, as Ethernet is expected to be
widely deployed, it is useful to describe the behavior of VLANs
inside encapsulated Ethernet frames.
As with any protocol, support for inner VLAN headers is OPTIONAL. In
many cases, the use of encapsulated VLANs may be disallowed due to
security or implementation considerations. However, in other cases, the trunking of VLAN frames across a Geneve tunnel can prove useful. As
a result, the processing of inner VLAN tags upon ingress or egress
from a tunnel endpoint is based upon the configuration of the tunnel
endpoint and/or control plane and is not explicitly defined as part of
the data format.Transition Considerations
Viewed exclusively from the data plane, Geneve is compatible with existing IP networks
as it appears to most devices as UDP packets.
However, as there are already a number of tunneling protocols deployed
in network virtualization environments, there is a practical question
of transition and coexistence.
Since Geneve builds on the base data plane functionality provided by the most
common protocols used for network virtualization (VXLAN and NVGRE),
it should be straightforward to port an existing control plane
to run on top of it with minimal effort. With both the old and new
packet formats supporting the same set of capabilities, there is no
need for a hard transition; tunnel endpoints directly communicating with
each other can use any common protocol, which may be different even
within a single overall system.
As transit devices are primarily
forwarding packets on the basis of the IP header, all protocols
appear to be similar, and these devices do not introduce additional
interoperability concerns.
To assist with this transition, it is strongly suggested that
implementations support simultaneous operation of both Geneve and
existing tunneling protocols, as it is expected to be common for a single
node to communicate with a mixture of other nodes. Eventually, older
protocols may be phased out as they are no longer in use.Security Considerations
As it is encapsulated within a UDP/IP packet, Geneve does not have any inherent security
mechanisms.
As a result, an attacker with access to the underlay
network transporting the IP packets has the ability to snoop on, alter, or
inject packets. Compromised tunnel endpoints or transit devices may also
spoof identifiers in the tunnel header to gain access to networks
owned by other tenants.
Within a particular security domain, such as a data center operated
by a single service provider, the most common and highest-performing security
mechanism is isolation of trusted components. Tunnel traffic can be
carried over a separate VLAN and filtered at any untrusted
boundaries.
When crossing an untrusted link, such as the general Internet, VPN technologies such as IPsec
should be used to provide authentication and/or encryption of
the IP packets formed as part of Geneve encapsulation (see ).
Geneve does not otherwise affect the security of the encapsulated
packets. As per the guidelines of BCP 72 , the following sections
describe potential security risks that may be applicable to Geneve deployments
and approaches to mitigate such risks. It is also noted that not all such risks are applicable
to all Geneve deployment scenarios, i.e., only a subset may be applicable to certain deployments.
An operator has to make an assessment based on their network
environment, determine the risks that are applicable to their specific environment, and use appropriate mitigation approaches as applicable. Data Confidentiality
Geneve is a network virtualization overlay encapsulation protocol
designed to establish tunnels between NVEs
over an existing IP network. It can be used to deploy multi-tenant overlay networks
over an existing IP underlay network in a public or private data center.
The overlay service is typically provided by a service provider, such as a
cloud service provider or a private data center operator. This may or not may be
the same provider as an underlay service provider. Due to the nature of multi-tenancy in such environments,
a tenant system may expect data confidentiality to ensure its packet data is not tampered with
(i.e., active attack) in transit or is a target of unauthorized
monitoring (i.e., passive attack), for example, by other tenant systems or underlay service provider.
A compromised network node or a transit device within a
data center may passively monitor Geneve packet data between NVEs or route
traffic for further inspection. A tenant may
expect the overlay service provider to provide data confidentiality as part of the service, or
a tenant may bring its own data confidentiality mechanisms like IPsec or TLS to protect the data
end to end between its tenant systems. The overlay provider is expected to provide
cryptographic protection in cases where the underlay provider is not the
same as the overlay provider to ensure the payload is not exposed to the underlay.
If an operator determines data confidentiality is necessary in their environment
based on their risk analysis -- for example, in multi-tenant
environments -- then an encryption mechanism SHOULD be used to encrypt the tenant
data end to end between the NVEs. The NVEs may use existing well-established
encryption mechanisms, such as IPsec, DTLS, etc.Inter-Data Center Traffic
A tenant system in a customer premises (private data center) may want to connect
to tenant systems on their tenant overlay network in a public cloud data center, or a tenant may want to have its tenant systems located in multiple geographically
separated data centers for high availability. Geneve data traffic between tenant systems
across such separated networks should be protected from threats when traversing public networks.
Any Geneve overlay data leaving the data center network beyond the operator's security domain
SHOULD be secured by encryption mechanisms, such as
IPsec or other VPN technologies, to protect the communications between the NVEs
when they are geographically separated over untrusted network links. Specification of
data protection mechanisms employed between data centers is beyond the scope of this document.
The principles described in regarding controlled environments still apply to
the geographically separated data center usage outlined in this section.Data Integrity
Geneve encapsulation is used between NVEs to establish overlay tunnels over an existing
IP underlay network. In a multi-tenant data center, a rogue or compromised tenant system
may try to launch a passive attack, such as monitoring the traffic of other tenants, or an
active attack, such as trying to inject unauthorized Geneve encapsulated traffic such
as spoofing, replay, etc., into the network. To prevent such attacks, an NVE MUST NOT
propagate Geneve packets beyond the NVE to tenant systems and SHOULD employ packet-filtering
mechanisms so as not to forward unauthorized traffic between tenant systems in different tenant networks.
An NVE MUST NOT interpret Geneve packets from tenant systems other than as frames to be encapsulated.
A compromised network node or a transit device within a data center may launch an active
attack trying to tamper with the Geneve packet data between NVEs. Malicious tampering of
Geneve header fields may cause the packet from one tenant to be forwarded to a different
tenant network. If an operator determines there is a possibility of such a threat in their environment,
the operator may choose to employ data integrity mechanisms between NVEs. In order to prevent
such risks, a data integrity mechanism SHOULD be used in such environments to protect the
integrity of Geneve packets, including packet headers, options, and payload on communications
between NVE pairs. A cryptographic data protection mechanism, such as IPsec, may be used to
provide data integrity protection. A data center operator may choose to deploy any other
data integrity mechanisms as applicable and supported in their underlay networks,
although non-cryptographic mechanisms may not protect the Geneve portion of the packet from tampering. Authentication of NVE Peers
A rogue network device or a compromised NVE in a data center environment might be able to
spoof Geneve packets as if it came from a legitimate NVE. In order to mitigate such a risk,
an operator SHOULD use an authentication mechanism, such as IPsec, to ensure that the
Geneve packet originated from the intended NVE peer in environments where the operator
determines spoofing or rogue devices are potential threats. Other simpler source checks,
such as ingress filtering for VLAN/MAC/IP addresses, reverse path forwarding checks, etc.,
may be used in certain trusted environments to ensure Geneve packets originated
from the intended NVE peer.Options Interpretation by Transit Devices
Options, if present in the packet, are generated and terminated by tunnel endpoints. As indicated
in , transit devices may interpret the options. However,
if the packet is protected by encryption from tunnel endpoint
to tunnel endpoint (for example, through IPsec), transit devices will not have visibility into the Geneve header or options
in the packet. In such cases, transit devices MUST handle Geneve packets as any other IP packet
and maintain consistent forwarding behavior. In cases where options are interpreted by transit devices, the operator
MUST ensure that transit devices are trusted and not compromised. The definition of
a mechanism to ensure this trust is beyond the scope of this document.Multicast/Broadcast
In typical data center networks where IP multicasting is not supported in the underlay
network, multicasting may be supported using multiple unicast tunnels. The same security
requirements as described in the above sections can be used to protect Geneve communications
between NVE peers. If IP multicasting is supported in the underlay network and the operator
chooses to use it for multicast traffic among tunnel endpoints, then the operator in such
environments may use data protection mechanisms, such as IPsec with multicast
extensions , to protect multicast traffic among Geneve NVE groups.Control Plane Communications
A Network Virtualization Authority (NVA) as outlined in may
be used as a control plane for configuring and managing the Geneve NVEs. The data center
operator is expected to use security mechanisms to protect the communications between
the NVA and NVEs and to use authentication mechanisms to detect any rogue or compromised
NVEs within their administrative domain. Data protection mechanisms for control plane
communication or authentication mechanisms between the NVA and NVEs are beyond
the scope of this document.IANA Considerations
IANA has allocated UDP port 6081 in the "Service Name and Transport Protocol
Port Number Registry" as the well-known destination port
for Geneve:
In addition, IANA has created a new subregistry titled "Geneve Option Class"
for option classes. This registry has been placed under
a new "Network Virtualization Overlay (NVO3)" heading in the IANA protocol registries .
The "Geneve Option Class" registry consists of
16-bit hexadecimal values along with descriptive strings, assignee/contact information, and references.
The registration rules for the new registry are (as defined by ):
Geneve Option Class Registry Ranges
Range
Registration Procedures
0x0000-0x00FF
IETF Review
0x0100-0xFEFF
First Come First Served
0xFF00-0xFFFF
Experimental Use
ReferencesNormative ReferencesInformative ReferencesIEEE 802 NumbersIANAProtocol RegistriesIANAService Name and Transport Protocol Port Number RegistryIANAIEEE Standard for Local and Metropolitan Area Networks--Bridges and Bridged NetworksIEEEVL2: A Scalable and Flexible Data Center NetworkACM SIGCOMM Computer Communication ReviewAcknowledgements
The authors wish to acknowledge ,
, ,
,
, ,
, ,
, , , , ,
, , ,
, , , , , ,
, and many other members of the NVO3 Working Group for their reviews, comments, and suggestions.
The authors would like to thank ,
, ,
, and
for their guidance throughout the process.Contributors
The following individuals were authors of an earlier version of this
document and made significant contributions:Microsoft Corporation1 Microsoft WayRedmondWA98052United States of Americapankajg@microsoft.comRed Hat Inc.1801 Varsity DriveRaleighNC27606United States of Americachrisw@redhat.comArista Networks5453 Great America ParkwaySanta ClaraCA95054United States of Americakduda@arista.comIndependentdidutt@gmail.comIndependentjon.hudson@gmail.comFacebook, Inc.1 Hacker WayMenlo ParkCA94025United States of Americaahendel@fb.com