Home > Specs > BitTorrent > Peer Connections
UDP Peer Connection
This is a streaming protocol that is almost exactly like TCP, except it is managed by the application layer, over UDP datagrams. The main advantage is that UDP can sometimes be holepunched through restrictive NAT routers and firewalls.
Every packet has a 20 byte header in the following format:
Type and Version 1 byte
The lowest 4 bits are the version, which should be set to 1.
The highest 4 bits indicate the type of packet:
0
Data
This type of packet will always have a data payload. This will increment the sequence number.
1
Finalize
This is like a data packet, but with no payload. It will increment the sequence number and this will be the last sequence number in the stream.
2
State
This is used to transmit an acknowledgement without any data. The sequence number is not incremented.
3
Reset
This indicates a full loss of the connection. The sender no longer has any running sockets matching this connection ID.
4
SYN
This is the first packet sent when initiating a connection. It's sequence number is always 1. There is no data payload.
First Extension ID 1 byte
If there are any extensions to the header, this byte will be nonzero. Following the header, there will be a 1-byte next extension ID, followed by the first extension's data length (1 byte) followed by that many bytes of data. If there is a next extension ID, this same structure will follow the data of the first extension, perhaps multiple times if there are many extensions.
There is currently only one extension ID,
1
Selective Acks. This allows non-sequential acknowledgement to reduce unnecessary repeat transmission of packets. More information about this extension is below.
Connection ID 2 bytes
This is chosen at random by the connection initiator.
When sending the SYN packet, this is the ID the remote peer is expected to use when it is sending. If it's not a SYN packet, the connection initiating peer will use the ID sent with the SYN + 1.
Timestamp 4 bytes big-endian
This is a monotonic timestamp in microseconds for this packet.
Timestamp Diff 4 bytes big-endian
This is calculated when the last packet was received by the peer sending this packet. It is the sending peer's monotonic timestamp minus the timestamp in the received packet. This should be zero if the sending peer hasn't yet received any packets from the remote peer.
Window Size 4 bytes big-endian
This will be set to the number of bytes the sending peer can receive without overflowing its receive buffer. New data packets should not be sent if there will be more bytes than this that have been sent yet are unacknowledged.
Sequence Number 2 bytes big-endian
Each new data packet (not retransmissions) pre-increments the sequence number. The connection initiator's SYN packet has a sequence of 1. The first data packet will therefore have a sequence number of 2.
The connection receiver will initialize it's first sending sequence number with a random number when it replies to the first SYN packet with a State packet. It's first data packet will have a sequence number one higher than the random number sent with the first State packet.
Acknowledgement Number 2 bytes big-endian
This is the sequence number of the last contiguous packet that has been received by the sender of this packet. For example, if sequence 1, 2, 3, and 5 have been received, this number should be 3. In this case, sequence 5 would be acknowledged with a Selective Ack header extension (see below).
This is a header extension with ID 1. When it is present, the First Extension ID in the header will be 1.
After the header, add a single byte that indicates the next extension ID (which should always be zero because there are no other extensions) and then another byte which specifies the length (in bytes) of this extension's data.
The data represents a bitfield that indicates which sequences have been received ahead of the currently acknowledged last contiguous data sequence. This bitfield's length must be in 4-byte increments. The lowest bit of the first byte represents the current acknowledged sequence + 2, because it is assumed that sequence + 1 is zero, otherwise the ack sequence number would be moved forward.
The time it takes for the remote side to acknowledge a data packet after it's first transmission should be tracked continuously so that an accurate timeout threshold for retransmission can be calculated. There are many methods to do this such as keeping the average highest round trip time over a period of a few seconds. It may also be useful to consider whether the packet has been skipped (ie. Selective Ack) and perhaps use a shorter timeout in that situation. It is ultimately an implementation choice, with various tradeoffs between false positives and fast loss detection and recovery.
Data can be sent, unacknowledged, up to the size of the remote receive window, which is advertised in the remote's header. Out of sequence data packets that have been selectively acknowledged still take space in the remote receive window, therefore the window specifies the maximum number of bytes that can be transmitted past the base acknowledged sequence number in the header.
A maximum send window should also be used internally to initially limit pipelining and then slowly increase the number of outstanding data packets allowed as acknowledgements are received. This send window must also slowly contract if there is nothing more to send.
An acknowledgement should be sent as soon as a data packet is received, whether it is just a State packet or a Data packet going the other way. There should be as little delay as possible before sending an acknowledgement, as these will affect the remote peer's RTT calculations and the rate of send window increase.
When a local socket receives a call to Send from the layers above, it should impose a very small delay (50-100 ms) before sending any data packets if there is not enough to fill an entire packet.
There is no MTU discovery built into the protocol. It may be possible to use ICMP messages to do this but it is probably not practical. A simple 1200 byte maximum data packet size works well. In the very rare case where this is too large, the system UDP layer will fragment and reassemble the packet.
Some implementations use the timestamps provided in the header to regulate the send window, but this method can easily be gamed by the remote peer through manipulation of the timestamp and timestamp difference header fields to artificially speed up and prioritize the remote socket.
For more insight into flow control algorithms and other implementation details, refer to the various well-known
TCP congestion control algorithms and evaluate their various compromises and trade-offs.