nanomsg
Home Download Documentation Development Community Support
ZeroTier Mapping for Scalability Protocols

draft

ZeroTier Mapping for Scalability Protocols

Abstract

This document defines the ZeroTier mapping for scalability protocols. This enables SP protocols to run over a ZeroTier network. The transport defined here sits on top of an unreliable virtual Layer 2 transport, and does not require a TCP/IP stack.

License

Copyright 2018 Staysail Systems, Inc.
Copyright 2018 Capitar IT Group BV

This specification is licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the license online.

Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Underlying protocol

ZeroTier expresses an 802.3 style layer 2, where frames maybe exchanged as if they were Ethernet frames. Virtual broadcast domains are created within a numbered "network", and frames may then be exchanged with any peers on that network.

Frames may arrive in any order, or be lost, just a with Ethernet (best effort delivery), but they are strongly protected by a cryptographic checksum, so frames that do arrive will be uncorrupted. Furthermore, ZeroTier guarantees that a given frame will be received at most once.

Each application on a ZeroTier network has its own address[1], called a ZeroTier ID (ZTID), which is globally unique — this is generated from a hash of the public key associated with the application.

A given application may participate in multiple ZeroTier networks.

Sharing of ZeroTier IDs between applications, as well as use of multiple ZTID values within a single application, as well as management of the associated ZeroTier-specific state is out of scope for this document.

ZeroTier networks have a standard MTU of 2800 bytes, but over typical public networks an "optimum" MTU of 1400 bytes is used. ZeroTier may be configured to have larger MTUs, but typically this involves extensive reassembly at underlying layers, and implementations SHOULD use the optimum MTU advertised by the ZeroTier implementation.

Note that at this time, broadcast and multicast is not supported by this mapping. (A future update may resolve this.)

Packet layout

Each SP message sent over ZeroTier is comprised of one or more fragments, where each fragment is mapped to a single underlying ZeroTier L2 frame. We use the EtherType field of 0x0901 to indicate SP over ZeroTier protocol (number to be registered with IEEE).

The ZeroTier L2 payload shall be encoded with a header as follows:

zerotier0 header

All numeric fields are in big-endian byte order. Note that ZeroTier APIs present this as the L2 payload, but ZeroTier itself may prepend additional data such as the Ethernet type, and source and destination MAC addresses, as well as ZeroTier specific headers. The details of such headers are out of scope for this document.

As above, the start of each frame is just as a normal Ethernet payload. The Ethernet type (ethertype) we use for these frames is 0x901, with a VLAN ID of 0.

The op is a field that indicates the type of message being sent. The following values are defined:

0x00

DATA

0x10

CONN-REQ

0x12

CONN-ACK

0x20

DISC

0x30

PING

0x32

PONG

0x40

ERR

These are discussed further below. Implementations MUST discard messages where the op is not one of these.

The flags field is reserved for future use, and MUST be zero. Implementations MUST discard frames for which this is not true.

The version byte MUST be set to 0x1. Implementations MUST discard any messages received for any other version.

The source port and destination port are used to construct a logical conversation. These are 24-bits wide, and are discussed further below. The reserved fields must be set to zero.

The remainder of frame varies depending on the op used.

Note that it is not by accident that the payload is 32-bit aligned in this message format. The payload is actually 64-bit aligned.

Port Fields

The port fields are used to discriminate different uses, allowing one application to have multiple connections or sockets open. The purpose is analogous to TCP port numbers, except that instead of the operating system performing the discrimination the application or library code must do so. Note that port numbers are 24-bits. This was chosen to allow a peer to allocate a unique port number for each local conversation, allowing up to 16 million concurrent conversations. This also allows a 40-bit node number to be combined with the 24-bit port number to create a 64-bit unique address.

DATA messages

DATA messages carry SP protocol payload data. They can only be sent on an established session (see CONN messages below), and are never acknowledged (in this version). The op-specific payload they carry is formed like this:

zerotier0 data

All fragments, except for the last, MUST be the same size. The fragment size field carries the size of every fragment, except that the last fragment may be shorter; however even for the last fragment, the fragment size MUST be the size of the rest of the fragments. This is necessary to allow a receiver to know the fragment size of the other fragments even if the final fragment is received before any others. (Typically this may occur if a message consisting of two fragments arrives with fragments out of order.)

The last fragment shall have the fragment number equal to the total fragments minus one, and the first fragment shall have fragment number 0. Under typical optimal conditions, with an optimal MTU of 1400 bytes, the largest message that can be transmitted is approximately 86 MB. Specifically the limit is (65534 * (1400 - 20)) = 90,436,920 bytes. (Larger MTUs may be used, if the implementation determines that it is advantageous to do so. Doing so would necessarily give a larger maximum message size.)

However, transmitting such a large message would require sending over 65 thousand fragments, and given the likelihood of fragment loss, and the lack of acknowledgment, it is likely that the entire message would be lost. As a result, implementations are encouraged to limit the amount of data that they send to at most a few megabytes. Implementations receiving the first fragment can easily calculate the worst case for the message size (the size of the user payload multiplied by the total number of fragments), and MAY reply to the sender with an ERR message using the code 0x05, indicating that the message is larger than the receiver is willing to accept.

Each fragment for a given message must carry the same message ID. Implementations MUST initialize this to a random value when starting a conversation, and MUST increment this each time a new message is sent. Message IDs of zero are not permitted; implementations MUST skip past zero when incrementing message IDs.

Implementations may detect the loss of a message by noticing skips in the message IDs that are received, accounting for the expected skip past zero.

Note that no field conveys the length of the fragment itself, as this can be determined from the L2 length — the user data within the fragment extends to the end of the L2 payload supplied by ZeroTier. (And, all fragments other than the final fragment for a message must therefore have the same length.)

CONN-REQ and CONN-ACK messages

CONN-REQ frames represent a request from an initiator to establish a session, i.e. a new conversation or connection, and CONN-ACK messages are the normal successful reply from the responder. They both take the same form, which consists of the usual headers along with the senders 16-bit (big-endian) SP protocol ID appended:

zerotier0 conn

The connection is initiated by the initiator sending this message, with its own SP protocol ID, with the op set to CONN-REQ. The initiator must choose a source port number that is not currently being used with the remote peer. (Most implementations will choose a a source port that is not used at all. Source port numbers SHOULD be chosen randomly.)

The responder will acknowledge this by replying with its SP protocol ID in the 4-byte payload, using the CONN-ACK op. Additionally, the source port number that the responder replies with MUST be the one the intiator requested.

(Responders will identify the session using the initiators chosen source port, which the initiator MUST NOT concurrently use for any other sessions.)

Alternatively, a responder MAY reject the connection attempt by sending a suitably formed ERR message (see below).

If a sender does not receive a reply, it SHOULD retry this message before giving up and reporting an error to the user. It is recommended that a configurable number of retries and time interval be used.

Given modern Internet latencies of generally less than 500 ms, resending up to 12 CONN-REQ requests, once every 5 seconds, before giving up seems reasonable. (These times are somewhat larger to allow for ZeroTier path discovery to take place; this results in a timeout of approximately a minute.)

The initiator MUST NOT send any DATA messages for a conversation until it has received an ACK from the other party, and it MUST send all further messages for the conversation to the port number supplied by the responder.

If a CONN-REQ frame is received by a responder for a conversation that already exists, the responder MUST reply. Further, the source port it replies with, and the SP protocol IDs MUST be identical to what it first sent. This ensures that the CONN-REQ request is idempotent.

DISC messages

DISC messages are used to request a session be terminated. This notifies the remote sender that no more data will be sent or accepted, and the session resources may be released. There is no payload. There is no acknowledgment.

PING and PONG messages

In order to keep session state, implementations will generally store data for each session. In order to prevent a stale session from consuming these resources forever, and in order to keep underlying ZeroTier sessions alive, a PING message MAY be sent to a peer with whom a session has been established. This message has no payload.

If the PING is is successful, then the responder MUST reply with a PONG message. As with PING, the PONG message carries no payload.

There is no response to a PONG message.

In the event of an error, an implementation MAY reply with an ERR message.

Implementations SHOULD NOT initiate PING messages if they have either received other session messages recently.

Implementations SHOULD use a timeout T1 seconds of be used before initiating a message the first time, and that in the absence of a reply, up to N further attempts be made, separated by T2 seconds. If no reply to the N_th attempt is received after _T2 seconds have passed, then the remote peer should be assumed offline or dead, and the session closed.

The values for T1, T2, and N SHOULD be configurable, with recommended default values of 60, 10, and 5. With these values, sessions that appear dead after 2 minutes will be closed, and their resources reclaimed.

ERR messages

ERR messages indicate a failure in the session, and abruptly terminate the session. The payload for these messages consists of a single byte error code, followed by an ASCII message describing the error (not terminated by zero). This message MUST NOT be more than 128 bytes in length.

The following error codes are defined:

0x01

No party listening at that address or port.

0x02

No such session found.

0x03

SP protocol ID invalid.

0x04

Generic protocol error.

0x05

Message size too big.

0xff

Other uncategorized error.

Implementations MUST discard any session state upon receiving an ERR message. These messages are not acknowledged.

Message Reassembly

Implementations MUST accept and reassemble fragmented DATA messages. Implementations MUST discard fragmented messages of other types.

Messages larger than the ZeroTier MTU MUST be fragmented.

Implementations SHOULD limit the number of unassembled messages retained for reassembly, to minimize the likelihood of intentional abuse. It is suggested that at most 2 unassembled messages be retained. It is further suggested that if 2 or more unfragmented messages arrive before a message is reassembled, or more than 5 seconds pass before the reassembly is complete, that the unassembled fragments be discarded.

Ports

The port numbers are 24-bit fields, allowing a single ZTID to service multiple application layer protocols, which could be treated as separate end points, or as separate sockets in the application. The implementation is responsible for discriminating on these and delivering to the appropriate consumer.

As with UDP or TCP, it is intended that each party have its own port number, and that a pair of ports (combined with ZeroTier IDs) be used to identify a single conversation.

An SP server SHOULD allocate a port for number advertisement. It is expected clients will generate ephemeral port numbers.

Implementations are free to choose how to allocate port numbers, but it is RECOMMENDED that administratively configured port numbers are small, with the high order bit clear, and that numbers larger than 223 (high order bit set) be used for ephemeral allocations.

It is RECOMMENDED that separate short queues (perhaps just one or two messages long) be kept per local port in implementations, to prevent head-of-line blocking issues where backpressure on one consumer (perhaps just a single thread or socket) blocks others.

URI Format

The URI scheme used to represent ZeroTier addresses makes use of ZeroTier IDs, ZeroTier network IDs, and our own 24-bit ports.

The format SHALL be zt://ztid.nwid:port, where the nwid component represents the 64-bit hexadecimal ZeroTier network ID, the ztid represents the 40-bit hexadecimal ZeroTier Device ID, and the port is the 24-bit port number (decimal) previously described.

An implementation MAY allow the ztid 0 be replaced with * to indicate that the node’s local ZTID be used.

An implementation MAY permit the use of port number of 0 when listening, to indicate that a random ephemeral port should be chosen.

Security Considerations

The mapping isn’t intended to provide any additional security beyond that provided by ZeroTier itself. Managing the key materials used by ZeroTier is implementation-specific, and they must take the appropriate care when dealing with them.


1. Technically an application may have more than one ZeroTier address, but such uses are unusual.
"nanomsg" is a trademark of Garrett D'Amore.