ZeroTierOne/node/Switch.cpp
Adam Ierymenko 0e5651f353
1.12.0 merge to main (#2104)
* add note about forceTcpRelay

* Create a sample systemd unit for tcp proxy

* set gitattributes for rust & cargo so hashes dont conflict on Windows

* Revert "set gitattributes for rust & cargo so hashes dont conflict on Windows"

This reverts commit 032dc5c108195f6bbc2e224f00da5b785df4b7f9.

* Turn off autocrlf for rust source

Doesn't appear to play nice well when it comes to git and vendored cargo package hashes

* Fix #1883 (#1886)

Still unknown as to why, but the call to `nc->GetProperties()` can fail
when setting a friendly name on the Windows virtual ethernet adapter.
Ensure that `ncp` is not null before continuing and accessing the device
GUID.

* Don't vendor packages for zeroidc (#1885)

* Added docker environment way to join networks (#1871)

* add StringUtils

* fix headers
use recommended headers and remove unused headers

* move extern "C"
only JNI functions need to be exported

* cleanup

* fix ANDROID-50: RESULT_ERROR_BAD_PARAMETER typo

* fix typo in log message

* fix typos in JNI method signatures

* fix typo

* fix ANDROID-51: fieldName is uninitialized

* fix ANDROID-35: memory leak

* fix missing DeleteLocalRef in loops

* update to use unique error codes

* add GETENV macro

* add LOG_TAG defines

* ANDROID-48: add ZT_jnicache.cpp

* ANDROID-48: use ZT_jnicache.cpp and remove ZT_jnilookup.cpp and ZT_jniarray.cpp

* add Event.fromInt

* add PeerRole.fromInt

* add ResultCode.fromInt

* fix ANDROID-36: issues with ResultCode

* add VirtualNetworkConfigOperation.fromInt

* fix ANDROID-40: VirtualNetworkConfigOperation out-of-sync with ZT_VirtualNetworkConfigOperation enum

* add VirtualNetworkStatus.fromInt

* fix ANDROID-37: VirtualNetworkStatus out-of-sync with ZT_VirtualNetworkStatus enum

* add VirtualNetworkType.fromInt

* make NodeStatus a plain data class

* fix ANDROID-52: synchronization bug with nodeMap

* Node init work: separate Node construction and init

* add Node.toString

* make PeerPhysicalPath a plain data class

* remove unused PeerPhysicalPath.fixed

* add array functions

* make Peer a plain data class

* make Version a plain data class

* fix ANDROID-42: copy/paste error

* fix ANDROID-49: VirtualNetworkConfig.equals is wrong

* reimplement VirtualNetworkConfig.equals

* reimplement VirtualNetworkConfig.compareTo

* add VirtualNetworkConfig.hashCode

* make VirtualNetworkConfig a plain data class

* remove unused VirtualNetworkConfig.enabled

* reimplement VirtualNetworkDNS.equals

* add VirtualNetworkDNS.hashCode

* make VirtualNetworkDNS a plain data class

* reimplement VirtualNetworkRoute.equals

* reimplement VirtualNetworkRoute.compareTo

* reimplement VirtualNetworkRoute.toString

* add VirtualNetworkRoute.hashCode

* make VirtualNetworkRoute a plain data class

* add isSocketAddressEmpty

* add addressPort

* add fromSocketAddressObject

* invert logic in a couple of places and return early

* newInetAddress and newInetSocketAddress work
allow newInetSocketAddress to return NULL if given empty address

* fix ANDROID-38: stack corruption in onSendPacketRequested

* use GETENV macro

* JniRef work
JniRef does not use callbacks struct, so remove
fix NewGlobalRef / DeleteGlobalRef mismatch

* use PRId64 macros

* switch statement work

* comments and logging

* Modifier 'public' is redundant for interface members

* NodeException can be made a checked Exception

* 'NodeException' does not define a 'serialVersionUID' field

* 'finalize()' should not be overridden
this is fine to do because ZeroTierOneService calls close() when it is done

* error handling, error reporting, asserts, logging

* simplify loadLibrary

* rename Node.networks -> Node.networkConfigs

* Windows file permissions fix (#1887)

* Allow macOS interfaces to use multiple IP addresses (#1879)

Co-authored-by: Sean OMeara <someara@users.noreply.github.com>
Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* Fix condition where full HELLOs might not be sent when necessary (#1877)

Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* 1.10.4 version bumps

* Add security policy to repo (#1889)

* [+] add e2k64 arch (#1890)

* temp fix for ANDROID-56: crash inside newNetworkConfig from too many args

* 1.10.4 release notes

* Windows 1.10.4 Advanced Installer bump

* Revert "temp fix for ANDROID-56: crash inside newNetworkConfig from too many args"

This reverts commit dd627cd7f44ad623a110bb14f72d0bea72a09e30.

* actual fix for ANDROID-56: crash inside newNetworkConfig
cast all arguments to varargs functions as good style

* Fix addIp being called with applied ips (#1897)

This was getting called outside of the check for existing ips
Because of the added ifdef and a brace getting moved to the
wrong place.

```
if (! n.tap()->addIp(*ip)) {
	fprintf(stderr, "ERROR: unable to add ip address %s" ZT_EOL_S, ip->toString(ipbuf));
}
WinFWHelper::newICMPRule(*ip, n.config().nwid);

```

* 1.10.5 (#1905)

* 1.10.5 bump

* 1.10.5 for Windows

* 1.10.5

* Prevent path-learning loops (#1914)

* Prevent path-learning loops

* Only allow new overwrite if not bonded

* fix binding temporary ipv6 addresses on macos (#1910)

The check code wasn't running.

I don't know why !defined(TARGET_OS_IOS) would exclude code on
desktop macOS. I did a quick search and changed it to defined(TARGET_OS_MAC).
Not 100% sure what the most correct solution there is.

You can verify the old and new versions with

`ifconfig | grep temporary`

plus

`zerotier-cli info -j` -> listeningOn

* 1.10.6 (#1929)

* 1.10.5 bump

* 1.10.6

* 1.10.6 AIP for Windows.

* Release notes for 1.10.6 (#1931)

* Minor tweak to Synology Docker image script (#1936)

* Change if_def again so ios can build (#1937)

All apple's variables are "defined"
but sometimes they are defined as "0"

* move begin/commit into try/catch block (#1932)

Thread was exiting in some cases

* Bump openssl from 0.10.45 to 0.10.48 in /zeroidc (#1938)

Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.45 to 0.10.48.
- [Release notes](https://github.com/sfackler/rust-openssl/releases)
- [Commits](https://github.com/sfackler/rust-openssl/compare/openssl-v0.10.45...openssl-v0.10.48)

---
updated-dependencies:
- dependency-name: openssl
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* new drone bits

* Fix multiple network join from environment entrypoint.sh.release (#1961)

* _bond_m guards _bond, not _paths_m (#1965)

* Fix: warning: mutex '_aqm_m' is not held on every path through here [-Wthread-safety-analysis] (#1964)

* Bump h2 from 0.3.16 to 0.3.17 in /zeroidc (#1963)

Bumps [h2](https://github.com/hyperium/h2) from 0.3.16 to 0.3.17.
- [Release notes](https://github.com/hyperium/h2/releases)
- [Changelog](https://github.com/hyperium/h2/blob/master/CHANGELOG.md)
- [Commits](https://github.com/hyperium/h2/compare/v0.3.16...v0.3.17)

---
updated-dependencies:
- dependency-name: h2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* Add note that binutils is required on FreeBSD (#1968)

* Add prometheus metrics for Central controllers (#1969)

* add header-only prometheus lib to ext

* rename folder

* Undo rename directory

* prometheus simpleapi included on mac & linux

* wip

* wire up some controller stats

* Get windows building with prometheus

* bsd build flags for prometheus

* Fix multiple network join from environment entrypoint.sh.release (#1961)

* _bond_m guards _bond, not _paths_m (#1965)

* Fix: warning: mutex '_aqm_m' is not held on every path through here [-Wthread-safety-analysis] (#1964)

* Serve prom metrics from /metrics endpoint

* Add prom metrics for Central controller specific things

* reorganize metric initialization

* testing out a labled gauge on Networks

* increment error counter on throw

* Consolidate metrics definitions

Put all metric definitions into node/Metrics.hpp.  Accessed as needed
from there.

* Revert "testing out a labled gauge on Networks"

This reverts commit 499ed6d95e11452019cdf48e32ed4cd878c2705b.

* still blows up but adding to the record for completeness right now

* Fix runtime issues with metrics

* Add metrics files to visual studio project

* Missed an "extern"

* add copyright headers to new files

* Add metrics for sent/received bytes (total)

* put /metrics endpoint behind auth

* sendto returns int on Win32

---------

Co-authored-by: Leonardo Amaral <leleobhz@users.noreply.github.com>
Co-authored-by: Brenton Bostick <bostick@gmail.com>

* Central startup update (#1973)

* allow specifying authtoken in central startup

* set allowManagedFrom

* move redis_mem_notification to the correct place

* add node checkins metric

* wire up min/max connection pool size metrics

* x86_64-unknown-linux-gnu on ubuntu runner (#1975)

* adding incoming zt packet type metrics (#1976)

* use cpp-httplib for HTTP control plane (#1979)

refactored the old control plane code to use [cpp-httplib](https://github.com/yhirose/cpp-httplib) instead of a hand rolled HTTP server.  Makes the control plane code much more legible.  Also no longer randomly stops responding.

* Outgoing Packet Metrics (#1980)

add tx/rx labels to packet counters and add metrics for outgoing packets

* Add short-term validation test workflow (#1974)

Add short-term validation test workflow

* Brenton/curly braces (#1971)

* fix formatting

* properly adjust various lines
breakup multiple statements onto multiple lines

* insert {} around if, for, etc.

* Fix rust dependency caching (#1983)

* fun with rust caching

* kick

* comment out invalid yaml keys for now

* Caching should now work

* re-add/rename key directives

* bump

* bump

* bump

* Don't force rebuild on Windows build GH Action (#1985)

Switching `/t:ZeroTierOne:Rebuild` to just `/t:ZeroTierOne` allows the Windows build to use the rust cache.  `/t:ZeroTierOne:Rebuild` cleared the cache before building.

* More packet metrics (#1982)

* found path negotation sends that weren't accounted for

* Fix histogram so it will actually compile

* Found more places for packet metrics

* separate the bind & listen calls on the http backplane (#1988)

* fix memory leak (#1992)

* fix a couple of metrics (#1989)

* More aggressive CLI spamming (#1993)

* fix type signatures (#1991)

* Network-metrics (#1994)

* Add a couple quick functions for converting a uint64_t network ID/node ID into std::string

* Network metrics

* Peer metrics (#1995)

* Adding peer metrics

still need to be wired up for use

* per peer packet metrics

* Fix crash from bad instantiation of histogram

* separate alive & dead path counts

* Add peer metric update block

* add peer latency values in doPingAndKeepalive

* prevent deadlock

* peer latency histogram actually works now

* cleanup

* capture counts of packets to specific peers

---------

Co-authored-by: Joseph Henry <joseph.henry@zerotier.com>

* Metrics consolidation (#1997)

* Rename zt_packet_incoming -> zt_packet

Also consolidate zt_peer_packets into a single metric with tx and rx labels.  Same for ztc_tcp_data and ztc_udp_data

* Further collapse tcp & udp into metric labels for zt_data

* Fix zt_data metric description

* zt_peer_packets description fix

* Consolidate incoming/outgoing network packets to a single metric

* zt_incoming_packet_error -> zt_packet_error

* Disable peer metrics for central controllers

Can change in the future if needed, but given the traffic our controllers serve, that's going to be a *lot* of data

* Disable peer metrics for controllers pt 2

* Update readme files for metrics (#2000)

* Controller Metrics & Network Config Request Fix (#2003)

* add new metrics for network config request queue size and sso expirations
* move sso expiration to its own thread in the controller
* fix potential undefined behavior when modifying a set

* Enable RTTI in Windows build

The new prometheus histogram stuff needs it.

Access violation - no RTTI data!INVALID packet 636ebd9ee8cac6c0 from cafe9efeb9(2605:9880:200:1200:30:571:e34:51/9993) (unexpected exception in tryDecode())

* Don't re-apply routes on BSD

See issue #1986

* Capture setContent by-value instead of by-reference (#2006)

Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* fix typos (#2010)

* central controller metrics & request path updates (#2012)

* internal db metrics

* use shared mutexes for read/write locks

* remove this lock. only used for a metric

* more metrics

* remove exploratory metrics

place controller request benchmarks behind ifdef

* Improve validation test (#2013)

* fix init order for EmbeddedNetworkController (#2014)

* add constant for getifaddrs cache time

* cache getifaddrs - mac

* cache getifaddrs - linux

* cache getifaddrs - bsd

* cache getifaddrs - windows

* Fix oidc client lookup query

join condition referenced the wrong table.  Worked fine unless there were multiple identical client IDs

* Fix udp sent metric

was only incrementing by 1 for each packet sent

* Allow sending all surface addresses to peer in low-bandwidth mode

* allow enabling of low bandwidth mode on controllers

* don't unborrow bad connections

pool will clean them up later

* Multi-arch controller container (#2037)

create arm64 & amd64 images for central controller

* Update README.md

issue #2009

* docker tags change

* fix oidc auth url memory leak (#2031)

getAuthURL() was not calling zeroidc::free_cstr(url);

the only place authAuthURL is called, the url can be retrieved
from the network config instead.

You could alternatively copy the string and call free_cstr in getAuthURL.
If that's better we can change the PR.

Since now there are no callers of getAuthURL I deleted it.

Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* Bump openssl from 0.10.48 to 0.10.55 in /zeroidc (#2034)

Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.48 to 0.10.55.
- [Release notes](https://github.com/sfackler/rust-openssl/releases)
- [Commits](https://github.com/sfackler/rust-openssl/compare/openssl-v0.10.48...openssl-v0.10.55)

---
updated-dependencies:
- dependency-name: openssl
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* zeroidc cargo warnings (#2029)

* fix unused struct member cargo warning

* fix unused import cargo warning

* fix unused return value cargo warning

---------

Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* fix memory leak in macos ipv6/dns helper (#2030)

Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>

* Consider ZEROTIER_JOIN_NETWORKS in healthcheck (#1978)

* Add a 2nd auth token only for access to /metrics (#2043)

* Add a 2nd auth token for /metrics

Allows administrators to distribute a token that only has access to read
metrics and nothing else.

Also added support for using bearer auth tokens for both types of tokens

Separate endpoint for metrics #2041

* Update readme

* fix a couple of cases of writing the wrong token

* Add warning to cli for allow default on FreeBSD

It doesn't work.
Not possible to fix with deficient network
stack and APIs.

ZeroTierOne-freebsd # zerotier-cli set 9bee8941b5xxxxxx allowDefault=1
400 set Allow Default does not work properly on FreeBSD. See #580
root@freebsd13-a:~/ZeroTierOne-freebsd # zerotier-cli get 9bee8941b5xxxxxx allowDefault
1

* ARM64 Support for TapDriver6 (#1949)

* Release memory previously allocated by UPNP_GetValidIGD

* Fix ifdef that breaks libzt on iOS (#2050)

* less drone (#2060)

* Exit if loading an invalid identity from disk (#2058)

* Exit if loading an invalid identity from disk

Previously, if an invalid identity was loaded from disk, ZeroTier would
generate a new identity & chug along and generate a brand new identity
as if nothing happened.  When running in containers, this introduces the
possibility for key matter loss; especially when running in containers
where the identity files are mounted in the container read only.  In
this case, ZT will continue chugging along with a brand new identity
with no possibility of recovering the private key.

ZeroTier should exit upon loading of invalid identity.public/identity.secret #2056

* add validation test for #2056

* tcp-proxy: fix build

* Adjust tcp-proxy makefile to support metrics

There's no way to get the metrics yet. Someone will
have to add the http service.

* remove ZT_NO_METRIC ifdef

* Implement recvmmsg() for Linux to reduce syscalls. (#2046)

Between 5% and 40% speed improvement on Linux, depending on system configuration and load.

* suppress warnings: comparison of integers of different signs: 'int64_t' (aka 'long') and 'uint64_t' (aka 'unsigned long') [-Wsign-compare] (#2063)

* fix warning: 'OS_STRING' macro redefined [-Wmacro-redefined] (#2064)

Even though this is in ext, these particular chunks of code were added
by us, so are ok to modify.

* Apply default route a different way - macOS

The original way we applied default route, by forking
0.0.0.0/0 into 0/1 and 128/1 works, but if mac os has any networking
hiccups -if you change SSIDs or sleep/wake- macos erases the system default route.
And then all networking on the computer is broken.

to summarize the new way:
allowDefault=1
```
sudo route delete default 192.168.82.1
sudo route add default 10.2.0.2
sudo route add -ifscope en1 default 192.168.82.1
```

gives us this routing table
```
Destination        Gateway            RT_IFA             Flags        Refs      Use    Mtu          Netif Expire    rtt(ms) rttvar(ms)
default            10.2.0.2           10.2.0.18          UGScg          90        1   2800       feth4823
default            192.168.82.1       192.168.82.217     UGScIg
```

allowDefault=0
```
sudo route delete default
sudo route delete -ifscope en1 default
sudo route add default 192.168.82.1
```

Notice the I flag, for -ifscope, on the physical default route.

route change does not seem to work reliably.

* fix docker tag for controllers (#2066)

* Update build.sh (#2068)

fix mkwork compilation errors

* Fix network DNS on macOS

It stopped working for ipv4 only networks in Monterey.
See #1696

We add some config like so to System Configuration

```
scutil
show State:/Network/Service/9bee8941b5xxxxxx/IPv4
<dictionary> {
  Addresses : <array> {
    0 : 10.2.1.36
  }
  InterfaceName : feth4823
  Router : 10.2.1.36
  ServerAddress : 127.0.0.1
}

```

* Add search domain to macos dns configuration

Stumbled upon this while debugging something else.
If we add search domain to our system configuration for
network DNS, then search domains work:

```
ping server1                                                                                                                                                                                    ~
PING server1.my.domain (10.123.3.1): 56 data bytes
64 bytes from 10.123.3.1
```

* Fix reporting of secondaryPort and tertiaryPort See: #2039

* Fix typos (#2075)

* Disable executable stacks on assembly objects (#2071)

Add `--noexecstack` to the assembler flags so the resulting binary
will link with a non-executable stack.

Fixes zerotier/ZeroTierOne#1179

Co-authored-by: Joseph Henry <joseph.henry@zerotier.com>

* Test that starting zerotier before internet works

* Don't skip hellos when there are no paths available

working on #2082

* Update validate-1m-linux.sh

* Save zt node log files on abort

* Separate test and summary step in validator script

* Don't apply default route until zerotier is "online"

I was running into issues with restarting the zerotier service while
"full tunnel" mode is enabled.
When zerotier first boots, it gets network state from the cache
on disk. So it immediately applies all the routes it knew about
before it shutdown.
The network config may have change in this time.
If it has, then your default route is via a route
you are blocked from talking on. So you  can't get the current
network config, so your internet does not work.

Other options include
- don't use cached network state on boot
- find a better criteria than "online"

* Fix node time-to-online counter in validator script

* Export variables so that they are accessible by exit function

* Fix PortMapper issue on ZeroTier startup

See issue #2082

We use a call to libnatpmp::ininatpp to make sure the computer
has working network sockets before we go into the main
nat-pmp/upnp logic.

With basic exponenetial delay up to 30 seconds.

* testing

* Comment out PortMapper debug

this got left turned on in a confusing merge previously

* fix macos default route again

see commit fb6af1971 * Fix network DNS on macOS
adding that stuff to System Config causes this extra route to be added
which breaks ipv4 default route.
We figured out a weird System Coniguration setting
that works.

--- old
couldn't figure out how to fix it in SystemConfiguration
so here we are# Please enter the commit message for your changes. Lines starting

We also moved the dns setter to before the syncIps stuff
to help with a race condition. It didn't always work when
you re-joined a network with default route enabled.

* Catch all conditions in switch statement, remove trailing whitespaces

* Add setmtu command, fix bond lifetime issue

* Basic cleanups

* Check if null is passed to VirtualNetworkConfig.equals and name fixes

* ANDROID-96: Simplify and use return code from node_init directly

* Windows arm64 (#2099)

* ARM64 changes for 1.12

* 1.12 Windows advanced installer updates and updates for ARM64

* 1.12.0

* Linux build fixes for old distros.

* release notes

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: travis laduke <travisladuke@gmail.com>
Co-authored-by: Grant Limberg <grant.limberg@zerotier.com>
Co-authored-by: Grant Limberg <glimberg@users.noreply.github.com>
Co-authored-by: Leonardo Amaral <leleobhz@users.noreply.github.com>
Co-authored-by: Brenton Bostick <bostick@gmail.com>
Co-authored-by: Sean OMeara <someara@users.noreply.github.com>
Co-authored-by: Joseph Henry <joseph-henry@users.noreply.github.com>
Co-authored-by: Roman Peshkichev <roman.peshkichev@gmail.com>
Co-authored-by: Joseph Henry <joseph.henry@zerotier.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Stavros Kois <47820033+stavros-k@users.noreply.github.com>
Co-authored-by: Jake Vis <jakevis@outlook.com>
Co-authored-by: Jörg Thalheim <joerg@thalheim.io>
Co-authored-by: lison <imlison@foxmail.com>
Co-authored-by: Kenny MacDermid <kenny@macdermid.ca>
2023-08-23 14:24:21 -04:00

1224 lines
42 KiB
C++

/*
* Copyright (c)2013-2020 ZeroTier, Inc.
*
* Use of this software is governed by the Business Source License included
* in the LICENSE.TXT file in the project's root directory.
*
* Change Date: 2025-01-01
*
* On the date above, in accordance with the Business Source License, use
* of this software will be governed by version 2.0 of the Apache License.
*/
/****/
#include <stdio.h>
#include <stdlib.h>
#include <algorithm>
#include <utility>
#include <stdexcept>
#include "../version.h"
#include "../include/ZeroTierOne.h"
#include "Constants.hpp"
#include "RuntimeEnvironment.hpp"
#include "Switch.hpp"
#include "Node.hpp"
#include "InetAddress.hpp"
#include "Topology.hpp"
#include "Peer.hpp"
#include "SelfAwareness.hpp"
#include "Packet.hpp"
#include "Trace.hpp"
#include "Metrics.hpp"
namespace ZeroTier {
Switch::Switch(const RuntimeEnvironment *renv) :
RR(renv),
_lastBeaconResponse(0),
_lastCheckedQueues(0),
_lastUniteAttempt(8) // only really used on root servers and upstreams, and it'll grow there just fine
{
}
// Returns true if packet appears valid; pos and proto will be set
static bool _ipv6GetPayload(const uint8_t *frameData,unsigned int frameLen,unsigned int &pos,unsigned int &proto)
{
if (frameLen < 40) {
return false;
}
pos = 40;
proto = frameData[6];
while (pos <= frameLen) {
switch(proto) {
case 0: // hop-by-hop options
case 43: // routing
case 60: // destination options
case 135: // mobility options
if ((pos + 8) > frameLen) {
return false; // invalid!
}
proto = frameData[pos];
pos += ((unsigned int)frameData[pos + 1] * 8) + 8;
break;
//case 44: // fragment -- we currently can't parse these and they are deprecated in IPv6 anyway
//case 50:
//case 51: // IPSec ESP and AH -- we have to stop here since this is encrypted stuff
default:
return true;
}
}
return false; // overflow == invalid
}
void Switch::onRemotePacket(void *tPtr,const int64_t localSocket,const InetAddress &fromAddr,const void *data,unsigned int len)
{
int32_t flowId = ZT_QOS_NO_FLOW;
try {
const int64_t now = RR->node->now();
const SharedPtr<Path> path(RR->topology->getPath(localSocket,fromAddr));
path->received(now);
if (len == 13) {
/* LEGACY: before VERB_PUSH_DIRECT_PATHS, peers used broadcast
* announcements on the LAN to solve the 'same network problem.' We
* no longer send these, but we'll listen for them for a while to
* locate peers with versions <1.0.4. */
const Address beaconAddr(reinterpret_cast<const char *>(data) + 8,5);
if (beaconAddr == RR->identity.address()) {
return;
}
if (!RR->node->shouldUsePathForZeroTierTraffic(tPtr,beaconAddr,localSocket,fromAddr)) {
return;
}
const SharedPtr<Peer> peer(RR->topology->getPeer(tPtr,beaconAddr));
if (peer) { // we'll only respond to beacons from known peers
if ((now - _lastBeaconResponse) >= 2500) { // limit rate of responses
_lastBeaconResponse = now;
Packet outp(peer->address(),RR->identity.address(),Packet::VERB_NOP);
outp.armor(peer->key(),true,peer->aesKeysIfSupported());
Metrics::pkt_nop_out++;
path->send(RR,tPtr,outp.data(),outp.size(),now);
}
}
} else if (len > ZT_PROTO_MIN_FRAGMENT_LENGTH) { // SECURITY: min length check is important since we do some C-style stuff below!
if (reinterpret_cast<const uint8_t *>(data)[ZT_PACKET_FRAGMENT_IDX_FRAGMENT_INDICATOR] == ZT_PACKET_FRAGMENT_INDICATOR) {
// Handle fragment ----------------------------------------------------
Packet::Fragment fragment(data,len);
const Address destination(fragment.destination());
if (destination != RR->identity.address()) {
if ( (!RR->topology->amUpstream()) && (!path->trustEstablished(now)) ) {
return;
}
if (fragment.hops() < ZT_RELAY_MAX_HOPS) {
fragment.incrementHops();
// Note: we don't bother initiating NAT-t for fragments, since heads will set that off.
// It wouldn't hurt anything, just redundant and unnecessary.
SharedPtr<Peer> relayTo = RR->topology->getPeer(tPtr,destination);
if ((!relayTo)||(!relayTo->sendDirect(tPtr,fragment.data(),fragment.size(),now,false))) {
// Don't know peer or no direct path -- so relay via someone upstream
relayTo = RR->topology->getUpstreamPeer();
if (relayTo) {
relayTo->sendDirect(tPtr,fragment.data(),fragment.size(),now,true);
}
}
}
} else {
// Fragment looks like ours
const uint64_t fragmentPacketId = fragment.packetId();
const unsigned int fragmentNumber = fragment.fragmentNumber();
const unsigned int totalFragments = fragment.totalFragments();
if ((totalFragments <= ZT_MAX_PACKET_FRAGMENTS)&&(fragmentNumber < ZT_MAX_PACKET_FRAGMENTS)&&(fragmentNumber > 0)&&(totalFragments > 1)) {
// Fragment appears basically sane. Its fragment number must be
// 1 or more, since a Packet with fragmented bit set is fragment 0.
// Total fragments must be more than 1, otherwise why are we
// seeing a Packet::Fragment?
RXQueueEntry *const rq = _findRXQueueEntry(fragmentPacketId);
Mutex::Lock rql(rq->lock);
if (rq->packetId != fragmentPacketId) {
// No packet found, so we received a fragment without its head.
rq->flowId = flowId;
rq->timestamp = now;
rq->packetId = fragmentPacketId;
rq->frags[fragmentNumber - 1] = fragment;
rq->totalFragments = totalFragments; // total fragment count is known
rq->haveFragments = 1 << fragmentNumber; // we have only this fragment
rq->complete = false;
} else if (!(rq->haveFragments & (1 << fragmentNumber))) {
// We have other fragments and maybe the head, so add this one and check
rq->frags[fragmentNumber - 1] = fragment;
rq->totalFragments = totalFragments;
if (Utils::countBits(rq->haveFragments |= (1 << fragmentNumber)) == totalFragments) {
// We have all fragments -- assemble and process full Packet
for(unsigned int f=1;f<totalFragments;++f) {
rq->frag0.append(rq->frags[f - 1].payload(),rq->frags[f - 1].payloadLength());
}
if (rq->frag0.tryDecode(RR,tPtr,flowId)) {
rq->timestamp = 0; // packet decoded, free entry
} else {
rq->complete = true; // set complete flag but leave entry since it probably needs WHOIS or something
}
}
} // else this is a duplicate fragment, ignore
}
}
// --------------------------------------------------------------------
} else if (len >= ZT_PROTO_MIN_PACKET_LENGTH) { // min length check is important!
// Handle packet head -------------------------------------------------
const Address destination(reinterpret_cast<const uint8_t *>(data) + 8,ZT_ADDRESS_LENGTH);
const Address source(reinterpret_cast<const uint8_t *>(data) + 13,ZT_ADDRESS_LENGTH);
if (source == RR->identity.address()) {
return;
}
if (destination != RR->identity.address()) {
if ( (!RR->topology->amUpstream()) && (!path->trustEstablished(now)) && (source != RR->identity.address()) ) {
return;
}
Packet packet(data,len);
if (packet.hops() < ZT_RELAY_MAX_HOPS) {
packet.incrementHops();
SharedPtr<Peer> relayTo = RR->topology->getPeer(tPtr,destination);
if ((relayTo)&&(relayTo->sendDirect(tPtr,packet.data(),packet.size(),now,false))) {
if ((source != RR->identity.address())&&(_shouldUnite(now,source,destination))) {
const SharedPtr<Peer> sourcePeer(RR->topology->getPeer(tPtr,source));
if (sourcePeer) {
relayTo->introduce(tPtr,now,sourcePeer);
}
}
} else {
relayTo = RR->topology->getUpstreamPeer();
if ((relayTo)&&(relayTo->address() != source)) {
if (relayTo->sendDirect(tPtr,packet.data(),packet.size(),now,true)) {
const SharedPtr<Peer> sourcePeer(RR->topology->getPeer(tPtr,source));
if (sourcePeer) {
relayTo->introduce(tPtr,now,sourcePeer);
}
}
}
}
}
} else if ((reinterpret_cast<const uint8_t *>(data)[ZT_PACKET_IDX_FLAGS] & ZT_PROTO_FLAG_FRAGMENTED) != 0) {
// Packet is the head of a fragmented packet series
const uint64_t packetId = (
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[0]) << 56) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[1]) << 48) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[2]) << 40) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[3]) << 32) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[4]) << 24) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[5]) << 16) |
(((uint64_t)reinterpret_cast<const uint8_t *>(data)[6]) << 8) |
((uint64_t)reinterpret_cast<const uint8_t *>(data)[7])
);
RXQueueEntry *const rq = _findRXQueueEntry(packetId);
Mutex::Lock rql(rq->lock);
if (rq->packetId != packetId) {
// If we have no other fragments yet, create an entry and save the head
rq->flowId = flowId;
rq->timestamp = now;
rq->packetId = packetId;
rq->frag0.init(data,len,path,now);
rq->totalFragments = 0;
rq->haveFragments = 1;
rq->complete = false;
} else if (!(rq->haveFragments & 1)) {
// If we have other fragments but no head, see if we are complete with the head
if ((rq->totalFragments > 1)&&(Utils::countBits(rq->haveFragments |= 1) == rq->totalFragments)) {
// We have all fragments -- assemble and process full Packet
rq->frag0.init(data,len,path,now);
for(unsigned int f=1;f<rq->totalFragments;++f) {
rq->frag0.append(rq->frags[f - 1].payload(),rq->frags[f - 1].payloadLength());
}
if (rq->frag0.tryDecode(RR,tPtr,flowId)) {
rq->timestamp = 0; // packet decoded, free entry
} else {
rq->complete = true; // set complete flag but leave entry since it probably needs WHOIS or something
}
} else {
// Still waiting on more fragments, but keep the head
rq->frag0.init(data,len,path,now);
}
} // else this is a duplicate head, ignore
} else {
// Packet is unfragmented, so just process it
IncomingPacket packet(data,len,path,now);
if (!packet.tryDecode(RR,tPtr,flowId)) {
RXQueueEntry *const rq = _nextRXQueueEntry();
Mutex::Lock rql(rq->lock);
rq->flowId = flowId;
rq->timestamp = now;
rq->packetId = packet.packetId();
rq->frag0 = packet;
rq->totalFragments = 1;
rq->haveFragments = 1;
rq->complete = true;
}
}
// --------------------------------------------------------------------
}
}
} catch ( ... ) {} // sanity check, should be caught elsewhere
}
void Switch::onLocalEthernet(void *tPtr,const SharedPtr<Network> &network,const MAC &from,const MAC &to,unsigned int etherType,unsigned int vlanId,const void *data,unsigned int len)
{
if (!network->hasConfig()) {
return;
}
// Check if this packet is from someone other than the tap -- i.e. bridged in
bool fromBridged;
if ((fromBridged = (from != network->mac()))) {
if (!network->config().permitsBridging(RR->identity.address())) {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"not a bridge");
return;
}
}
uint8_t qosBucket = ZT_AQM_DEFAULT_BUCKET;
/**
* A pseudo-unique identifier used by balancing and bonding policies to
* categorize individual flows/conversations for assignment to a specific
* physical path. This identifier consists of the source port and
* destination port of the encapsulated frame.
*
* A flowId of -1 will indicate that there is no preference for how this
* packet shall be sent. An example of this would be an ICMP packet.
*/
int32_t flowId = ZT_QOS_NO_FLOW;
if (etherType == ZT_ETHERTYPE_IPV4 && (len >= 20)) {
uint16_t srcPort = 0;
uint16_t dstPort = 0;
uint8_t proto = (reinterpret_cast<const uint8_t *>(data)[9]);
const unsigned int headerLen = 4 * (reinterpret_cast<const uint8_t *>(data)[0] & 0xf);
switch(proto) {
case 0x01: // ICMP
//flowId = 0x01;
break;
// All these start with 16-bit source and destination port in that order
case 0x06: // TCP
case 0x11: // UDP
case 0x84: // SCTP
case 0x88: // UDPLite
if (len > (headerLen + 4)) {
unsigned int pos = headerLen + 0;
srcPort = (reinterpret_cast<const uint8_t *>(data)[pos++]) << 8;
srcPort |= (reinterpret_cast<const uint8_t *>(data)[pos]);
pos++;
dstPort = (reinterpret_cast<const uint8_t *>(data)[pos++]) << 8;
dstPort |= (reinterpret_cast<const uint8_t *>(data)[pos]);
flowId = dstPort ^ srcPort ^ proto;
}
break;
}
}
if (etherType == ZT_ETHERTYPE_IPV6 && (len >= 40)) {
uint16_t srcPort = 0;
uint16_t dstPort = 0;
unsigned int pos;
unsigned int proto;
_ipv6GetPayload((const uint8_t *)data, len, pos, proto);
switch(proto) {
case 0x3A: // ICMPv6
//flowId = 0x3A;
break;
// All these start with 16-bit source and destination port in that order
case 0x06: // TCP
case 0x11: // UDP
case 0x84: // SCTP
case 0x88: // UDPLite
if (len > (pos + 4)) {
srcPort = (reinterpret_cast<const uint8_t *>(data)[pos++]) << 8;
srcPort |= (reinterpret_cast<const uint8_t *>(data)[pos]);
pos++;
dstPort = (reinterpret_cast<const uint8_t *>(data)[pos++]) << 8;
dstPort |= (reinterpret_cast<const uint8_t *>(data)[pos]);
flowId = dstPort ^ srcPort ^ proto;
}
break;
default:
break;
}
}
if (to.isMulticast()) {
MulticastGroup multicastGroup(to,0);
if (to.isBroadcast()) {
if ( (etherType == ZT_ETHERTYPE_ARP) && (len >= 28) && ((((const uint8_t *)data)[2] == 0x08)&&(((const uint8_t *)data)[3] == 0x00)&&(((const uint8_t *)data)[4] == 6)&&(((const uint8_t *)data)[5] == 4)&&(((const uint8_t *)data)[7] == 0x01)) ) {
/* IPv4 ARP is one of the few special cases that we impose upon what is
* otherwise a straightforward Ethernet switch emulation. Vanilla ARP
* is dumb old broadcast and simply doesn't scale. ZeroTier multicast
* groups have an additional field called ADI (additional distinguishing
* information) which was added specifically for ARP though it could
* be used for other things too. We then take ARP broadcasts and turn
* them into multicasts by stuffing the IP address being queried into
* the 32-bit ADI field. In practice this uses our multicast pub/sub
* system to implement a kind of extended/distributed ARP table. */
multicastGroup = MulticastGroup::deriveMulticastGroupForAddressResolution(InetAddress(((const unsigned char *)data) + 24,4,0));
} else if (!network->config().enableBroadcast()) {
// Don't transmit broadcasts if this network doesn't want them
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"broadcast disabled");
return;
}
} else if ((etherType == ZT_ETHERTYPE_IPV6)&&(len >= (40 + 8 + 16))) {
// IPv6 NDP emulation for certain very special patterns of private IPv6 addresses -- if enabled
if ((network->config().ndpEmulation())&&(reinterpret_cast<const uint8_t *>(data)[6] == 0x3a)&&(reinterpret_cast<const uint8_t *>(data)[40] == 0x87)) { // ICMPv6 neighbor solicitation
Address v6EmbeddedAddress;
const uint8_t *const pkt6 = reinterpret_cast<const uint8_t *>(data) + 40 + 8;
const uint8_t *my6 = (const uint8_t *)0;
// ZT-RFC4193 address: fdNN:NNNN:NNNN:NNNN:NN99:93DD:DDDD:DDDD / 88 (one /128 per actual host)
// ZT-6PLANE address: fcXX:XXXX:XXDD:DDDD:DDDD:####:####:#### / 40 (one /80 per actual host)
// (XX - lower 32 bits of network ID XORed with higher 32 bits)
// For these to work, we must have a ZT-managed address assigned in one of the
// above formats, and the query must match its prefix.
for(unsigned int sipk=0;sipk<network->config().staticIpCount;++sipk) {
const InetAddress *const sip = &(network->config().staticIps[sipk]);
if (sip->ss_family == AF_INET6) {
my6 = reinterpret_cast<const uint8_t *>(reinterpret_cast<const struct sockaddr_in6 *>(&(*sip))->sin6_addr.s6_addr);
const unsigned int sipNetmaskBits = Utils::ntoh((uint16_t)reinterpret_cast<const struct sockaddr_in6 *>(&(*sip))->sin6_port);
if ((sipNetmaskBits == 88)&&(my6[0] == 0xfd)&&(my6[9] == 0x99)&&(my6[10] == 0x93)) { // ZT-RFC4193 /88 ???
unsigned int ptr = 0;
while (ptr != 11) {
if (pkt6[ptr] != my6[ptr]) {
break;
}
++ptr;
}
if (ptr == 11) { // prefix match!
v6EmbeddedAddress.setTo(pkt6 + ptr,5);
break;
}
} else if (sipNetmaskBits == 40) { // ZT-6PLANE /40 ???
const uint32_t nwid32 = (uint32_t)((network->id() ^ (network->id() >> 32)) & 0xffffffff);
if ( (my6[0] == 0xfc) && (my6[1] == (uint8_t)((nwid32 >> 24) & 0xff)) && (my6[2] == (uint8_t)((nwid32 >> 16) & 0xff)) && (my6[3] == (uint8_t)((nwid32 >> 8) & 0xff)) && (my6[4] == (uint8_t)(nwid32 & 0xff))) {
unsigned int ptr = 0;
while (ptr != 5) {
if (pkt6[ptr] != my6[ptr]) {
break;
}
++ptr;
}
if (ptr == 5) { // prefix match!
v6EmbeddedAddress.setTo(pkt6 + ptr,5);
break;
}
}
}
}
}
if ((v6EmbeddedAddress)&&(v6EmbeddedAddress != RR->identity.address())) {
const MAC peerMac(v6EmbeddedAddress,network->id());
uint8_t adv[72];
adv[0] = 0x60;
adv[1] = 0x00;
adv[2] = 0x00;
adv[3] = 0x00;
adv[4] = 0x00;
adv[5] = 0x20;
adv[6] = 0x3a;
adv[7] = 0xff;
for(int i=0;i<16;++i) {
adv[8 + i] = pkt6[i];
}
for(int i=0;i<16;++i) {
adv[24 + i] = my6[i];
}
adv[40] = 0x88;
adv[41] = 0x00;
adv[42] = 0x00;
adv[43] = 0x00; // future home of checksum
adv[44] = 0x60;
adv[45] = 0x00;
adv[46] = 0x00;
adv[47] = 0x00;
for(int i=0;i<16;++i) {
adv[48 + i] = pkt6[i];
}
adv[64] = 0x02;
adv[65] = 0x01;
adv[66] = peerMac[0];
adv[67] = peerMac[1];
adv[68] = peerMac[2];
adv[69] = peerMac[3];
adv[70] = peerMac[4];
adv[71] = peerMac[5];
uint16_t pseudo_[36];
uint8_t *const pseudo = reinterpret_cast<uint8_t *>(pseudo_);
for(int i=0;i<32;++i) {
pseudo[i] = adv[8 + i];
}
pseudo[32] = 0x00;
pseudo[33] = 0x00;
pseudo[34] = 0x00;
pseudo[35] = 0x20;
pseudo[36] = 0x00;
pseudo[37] = 0x00;
pseudo[38] = 0x00;
pseudo[39] = 0x3a;
for(int i=0;i<32;++i) {
pseudo[40 + i] = adv[40 + i];
}
uint32_t checksum = 0;
for(int i=0;i<36;++i) {
checksum += Utils::hton(pseudo_[i]);
}
while ((checksum >> 16)) {
checksum = (checksum & 0xffff) + (checksum >> 16);
}
checksum = ~checksum;
adv[42] = (checksum >> 8) & 0xff;
adv[43] = checksum & 0xff;
RR->node->putFrame(tPtr,network->id(),network->userPtr(),peerMac,from,ZT_ETHERTYPE_IPV6,0,adv,72);
return; // NDP emulation done. We have forged a "fake" reply, so no need to send actual NDP query.
} // else no NDP emulation
} // else no NDP emulation
}
// Check this after NDP emulation, since that has to be allowed in exactly this case
if (network->config().multicastLimit == 0) {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"multicast disabled");
return;
}
/* Learn multicast groups for bridged-in hosts.
* Note that some OSes, most notably Linux, do this for you by learning
* multicast addresses on bridge interfaces and subscribing each slave.
* But in that case this does no harm, as the sets are just merged. */
if (fromBridged) {
network->learnBridgedMulticastGroup(tPtr,multicastGroup,RR->node->now());
}
// First pass sets noTee to false, but noTee is set to true in OutboundMulticast to prevent duplicates.
if (!network->filterOutgoingPacket(tPtr,false,RR->identity.address(),Address(),from,to,(const uint8_t *)data,len,etherType,vlanId,qosBucket)) {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"filter blocked");
return;
}
RR->mc->send(
tPtr,
RR->node->now(),
network,
Address(),
multicastGroup,
(fromBridged) ? from : MAC(),
etherType,
data,
len);
} else if (to == network->mac()) {
// Destination is this node, so just reinject it
RR->node->putFrame(tPtr,network->id(),network->userPtr(),from,to,etherType,vlanId,data,len);
} else if (to[0] == MAC::firstOctetForNetwork(network->id())) {
// Destination is another ZeroTier peer on the same network
Address toZT(to.toAddress(network->id())); // since in-network MACs are derived from addresses and network IDs, we can reverse this
SharedPtr<Peer> toPeer(RR->topology->getPeer(tPtr,toZT));
if (!network->filterOutgoingPacket(tPtr,false,RR->identity.address(),toZT,from,to,(const uint8_t *)data,len,etherType,vlanId,qosBucket)) {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"filter blocked");
return;
}
network->pushCredentialsIfNeeded(tPtr,toZT,RR->node->now());
if (!fromBridged) {
Packet outp(toZT,RR->identity.address(),Packet::VERB_FRAME);
outp.append(network->id());
outp.append((uint16_t)etherType);
outp.append(data,len);
// 1.4.8: disable compression for unicast as it almost never helps
//if (!network->config().disableCompression())
// outp.compress();
aqm_enqueue(tPtr,network,outp,true,qosBucket,flowId);
} else {
Packet outp(toZT,RR->identity.address(),Packet::VERB_EXT_FRAME);
outp.append(network->id());
outp.append((unsigned char)0x00);
to.appendTo(outp);
from.appendTo(outp);
outp.append((uint16_t)etherType);
outp.append(data,len);
// 1.4.8: disable compression for unicast as it almost never helps
//if (!network->config().disableCompression())
// outp.compress();
aqm_enqueue(tPtr,network,outp,true,qosBucket,flowId);
}
} else {
// Destination is bridged behind a remote peer
// We filter with a NULL destination ZeroTier address first. Filtrations
// for each ZT destination are also done below. This is the same rationale
// and design as for multicast.
if (!network->filterOutgoingPacket(tPtr,false,RR->identity.address(),Address(),from,to,(const uint8_t *)data,len,etherType,vlanId,qosBucket)) {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"filter blocked");
return;
}
Address bridges[ZT_MAX_BRIDGE_SPAM];
unsigned int numBridges = 0;
/* Create an array of up to ZT_MAX_BRIDGE_SPAM recipients for this bridged frame. */
bridges[0] = network->findBridgeTo(to);
std::vector<Address> activeBridges(network->config().activeBridges());
if ((bridges[0])&&(bridges[0] != RR->identity.address())&&(network->config().permitsBridging(bridges[0]))) {
/* We have a known bridge route for this MAC, send it there. */
++numBridges;
} else if (!activeBridges.empty()) {
/* If there is no known route, spam to up to ZT_MAX_BRIDGE_SPAM active
* bridges. If someone responds, we'll learn the route. */
std::vector<Address>::const_iterator ab(activeBridges.begin());
if (activeBridges.size() <= ZT_MAX_BRIDGE_SPAM) {
// If there are <= ZT_MAX_BRIDGE_SPAM active bridges, spam them all
while (ab != activeBridges.end()) {
bridges[numBridges++] = *ab;
++ab;
}
} else {
// Otherwise pick a random set of them
while (numBridges < ZT_MAX_BRIDGE_SPAM) {
if (ab == activeBridges.end()) {
ab = activeBridges.begin();
}
if (((unsigned long)RR->node->prng() % (unsigned long)activeBridges.size()) == 0) {
bridges[numBridges++] = *ab;
++ab;
} else {
++ab;
}
}
}
}
for(unsigned int b=0;b<numBridges;++b) {
if (network->filterOutgoingPacket(tPtr,true,RR->identity.address(),bridges[b],from,to,(const uint8_t *)data,len,etherType,vlanId,qosBucket)) {
Packet outp(bridges[b],RR->identity.address(),Packet::VERB_EXT_FRAME);
outp.append(network->id());
outp.append((uint8_t)0x00);
to.appendTo(outp);
from.appendTo(outp);
outp.append((uint16_t)etherType);
outp.append(data,len);
// 1.4.8: disable compression for unicast as it almost never helps
//if (!network->config().disableCompression())
// outp.compress();
aqm_enqueue(tPtr,network,outp,true,qosBucket,flowId);
} else {
RR->t->outgoingNetworkFrameDropped(tPtr,network,from,to,etherType,vlanId,len,"filter blocked (bridge replication)");
}
}
}
}
void Switch::aqm_enqueue(void *tPtr, const SharedPtr<Network> &network, Packet &packet,bool encrypt,int qosBucket,int32_t flowId)
{
if(!network->qosEnabled()) {
send(tPtr, packet, encrypt, flowId);
return;
}
NetworkQoSControlBlock *nqcb = _netQueueControlBlock[network->id()];
if (!nqcb) {
nqcb = new NetworkQoSControlBlock();
_netQueueControlBlock[network->id()] = nqcb;
// Initialize ZT_QOS_NUM_BUCKETS queues and place them in the INACTIVE list
// These queues will be shuffled between the new/old/inactive lists by the enqueue/dequeue algorithm
for (int i=0; i<ZT_AQM_NUM_BUCKETS; i++) {
nqcb->inactiveQueues.push_back(new ManagedQueue(i));
}
}
// Don't apply QoS scheduling to ZT protocol traffic
if (packet.verb() != Packet::VERB_FRAME && packet.verb() != Packet::VERB_EXT_FRAME) {
send(tPtr, packet, encrypt, flowId);
}
_aqm_m.lock();
// Enqueue packet and move queue to appropriate list
const Address dest(packet.destination());
TXQueueEntry *txEntry = new TXQueueEntry(dest,RR->node->now(),packet,encrypt,flowId);
ManagedQueue *selectedQueue = nullptr;
for (size_t i=0; i<ZT_AQM_NUM_BUCKETS; i++) {
if (i < nqcb->oldQueues.size()) { // search old queues first (I think this is best since old would imply most recent usage of the queue)
if (nqcb->oldQueues[i]->id == qosBucket) {
selectedQueue = nqcb->oldQueues[i];
}
}
if (i < nqcb->newQueues.size()) { // search new queues (this would imply not often-used queues)
if (nqcb->newQueues[i]->id == qosBucket) {
selectedQueue = nqcb->newQueues[i];
}
}
if (i < nqcb->inactiveQueues.size()) { // search inactive queues
if (nqcb->inactiveQueues[i]->id == qosBucket) {
selectedQueue = nqcb->inactiveQueues[i];
// move queue to end of NEW queue list
selectedQueue->byteCredit = ZT_AQM_QUANTUM;
// DEBUG_INFO("moving q=%p from INACTIVE to NEW list", selectedQueue);
nqcb->newQueues.push_back(selectedQueue);
nqcb->inactiveQueues.erase(nqcb->inactiveQueues.begin() + i);
}
}
}
if (!selectedQueue) {
_aqm_m.unlock();
return;
}
selectedQueue->q.push_back(txEntry);
selectedQueue->byteLength+=txEntry->packet.payloadLength();
nqcb->_currEnqueuedPackets++;
// DEBUG_INFO("nq=%2lu, oq=%2lu, iq=%2lu, nqcb.size()=%3d, bucket=%2d, q=%p", nqcb->newQueues.size(), nqcb->oldQueues.size(), nqcb->inactiveQueues.size(), nqcb->_currEnqueuedPackets, qosBucket, selectedQueue);
// Drop a packet if necessary
ManagedQueue *selectedQueueToDropFrom = nullptr;
if (nqcb->_currEnqueuedPackets > ZT_AQM_MAX_ENQUEUED_PACKETS) {
// DEBUG_INFO("too many enqueued packets (%d), finding packet to drop", nqcb->_currEnqueuedPackets);
int maxQueueLength = 0;
for (size_t i=0; i<ZT_AQM_NUM_BUCKETS; i++) {
if (i < nqcb->oldQueues.size()) {
if (nqcb->oldQueues[i]->byteLength > maxQueueLength) {
maxQueueLength = nqcb->oldQueues[i]->byteLength;
selectedQueueToDropFrom = nqcb->oldQueues[i];
}
}
if (i < nqcb->newQueues.size()) {
if (nqcb->newQueues[i]->byteLength > maxQueueLength) {
maxQueueLength = nqcb->newQueues[i]->byteLength;
selectedQueueToDropFrom = nqcb->newQueues[i];
}
}
if (i < nqcb->inactiveQueues.size()) {
if (nqcb->inactiveQueues[i]->byteLength > maxQueueLength) {
maxQueueLength = nqcb->inactiveQueues[i]->byteLength;
selectedQueueToDropFrom = nqcb->inactiveQueues[i];
}
}
}
if (selectedQueueToDropFrom) {
// DEBUG_INFO("dropping packet from head of largest queue (%d payload bytes)", maxQueueLength);
int sizeOfDroppedPacket = selectedQueueToDropFrom->q.front()->packet.payloadLength();
delete selectedQueueToDropFrom->q.front();
selectedQueueToDropFrom->q.pop_front();
selectedQueueToDropFrom->byteLength-=sizeOfDroppedPacket;
nqcb->_currEnqueuedPackets--;
}
}
_aqm_m.unlock();
aqm_dequeue(tPtr);
}
uint64_t Switch::control_law(uint64_t t, int count)
{
return (uint64_t)(t + ZT_AQM_INTERVAL / sqrt(count));
}
Switch::dqr Switch::dodequeue(ManagedQueue *q, uint64_t now)
{
dqr r;
r.ok_to_drop = false;
r.p = q->q.front();
if (r.p == NULL) {
q->first_above_time = 0;
return r;
}
uint64_t sojourn_time = now - r.p->creationTime;
if (sojourn_time < ZT_AQM_TARGET || q->byteLength <= ZT_DEFAULT_MTU) {
// went below - stay below for at least interval
q->first_above_time = 0;
} else {
if (q->first_above_time == 0) {
// just went above from below. if still above at
// first_above_time, will say it's ok to drop.
q->first_above_time = now + ZT_AQM_INTERVAL;
} else if (now >= q->first_above_time) {
r.ok_to_drop = true;
}
}
return r;
}
Switch::TXQueueEntry * Switch::CoDelDequeue(ManagedQueue *q, bool isNew, uint64_t now)
{
dqr r = dodequeue(q, now);
if (q->dropping) {
if (!r.ok_to_drop) {
q->dropping = false;
}
while (now >= q->drop_next && q->dropping) {
q->q.pop_front(); // drop
r = dodequeue(q, now);
if (!r.ok_to_drop) {
// leave dropping state
q->dropping = false;
} else {
++(q->count);
// schedule the next drop.
q->drop_next = control_law(q->drop_next, q->count);
}
}
} else if (r.ok_to_drop) {
q->q.pop_front(); // drop
r = dodequeue(q, now);
q->dropping = true;
q->count = (q->count > 2 && now - q->drop_next < 8*ZT_AQM_INTERVAL)?
q->count - 2 : 1;
q->drop_next = control_law(now, q->count);
}
return r.p;
}
void Switch::aqm_dequeue(void *tPtr)
{
// Cycle through network-specific QoS control blocks
for(std::map<uint64_t,NetworkQoSControlBlock*>::iterator nqcb(_netQueueControlBlock.begin());nqcb!=_netQueueControlBlock.end();) {
if (!(*nqcb).second->_currEnqueuedPackets) {
return;
}
uint64_t now = RR->node->now();
TXQueueEntry *entryToEmit = nullptr;
std::vector<ManagedQueue*> *currQueues = &((*nqcb).second->newQueues);
std::vector<ManagedQueue*> *oldQueues = &((*nqcb).second->oldQueues);
std::vector<ManagedQueue*> *inactiveQueues = &((*nqcb).second->inactiveQueues);
_aqm_m.lock();
// Attempt dequeue from queues in NEW list
bool examiningNewQueues = true;
while (currQueues->size()) {
ManagedQueue *queueAtFrontOfList = currQueues->front();
if (queueAtFrontOfList->byteCredit < 0) {
queueAtFrontOfList->byteCredit += ZT_AQM_QUANTUM;
// Move to list of OLD queues
// DEBUG_INFO("moving q=%p from NEW to OLD list", queueAtFrontOfList);
oldQueues->push_back(queueAtFrontOfList);
currQueues->erase(currQueues->begin());
} else {
entryToEmit = CoDelDequeue(queueAtFrontOfList, examiningNewQueues, now);
if (!entryToEmit) {
// Move to end of list of OLD queues
// DEBUG_INFO("moving q=%p from NEW to OLD list", queueAtFrontOfList);
oldQueues->push_back(queueAtFrontOfList);
currQueues->erase(currQueues->begin());
} else {
int len = entryToEmit->packet.payloadLength();
queueAtFrontOfList->byteLength -= len;
queueAtFrontOfList->byteCredit -= len;
// Send the packet!
queueAtFrontOfList->q.pop_front();
send(tPtr, entryToEmit->packet, entryToEmit->encrypt, entryToEmit->flowId);
(*nqcb).second->_currEnqueuedPackets--;
}
if (queueAtFrontOfList) {
//DEBUG_INFO("dequeuing from q=%p, len=%lu in NEW list (byteCredit=%d)", queueAtFrontOfList, queueAtFrontOfList->q.size(), queueAtFrontOfList->byteCredit);
}
break;
}
}
// Attempt dequeue from queues in OLD list
examiningNewQueues = false;
currQueues = &((*nqcb).second->oldQueues);
while (currQueues->size()) {
ManagedQueue *queueAtFrontOfList = currQueues->front();
if (queueAtFrontOfList->byteCredit < 0) {
queueAtFrontOfList->byteCredit += ZT_AQM_QUANTUM;
oldQueues->push_back(queueAtFrontOfList);
currQueues->erase(currQueues->begin());
} else {
entryToEmit = CoDelDequeue(queueAtFrontOfList, examiningNewQueues, now);
if (!entryToEmit) {
//DEBUG_INFO("moving q=%p from OLD to INACTIVE list", queueAtFrontOfList);
// Move to inactive list of queues
inactiveQueues->push_back(queueAtFrontOfList);
currQueues->erase(currQueues->begin());
} else {
int len = entryToEmit->packet.payloadLength();
queueAtFrontOfList->byteLength -= len;
queueAtFrontOfList->byteCredit -= len;
queueAtFrontOfList->q.pop_front();
send(tPtr, entryToEmit->packet, entryToEmit->encrypt, entryToEmit->flowId);
(*nqcb).second->_currEnqueuedPackets--;
}
if (queueAtFrontOfList) {
//DEBUG_INFO("dequeuing from q=%p, len=%lu in OLD list (byteCredit=%d)", queueAtFrontOfList, queueAtFrontOfList->q.size(), queueAtFrontOfList->byteCredit);
}
break;
}
}
nqcb++;
_aqm_m.unlock();
}
}
void Switch::removeNetworkQoSControlBlock(uint64_t nwid)
{
NetworkQoSControlBlock *nq = _netQueueControlBlock[nwid];
if (nq) {
_netQueueControlBlock.erase(nwid);
delete nq;
nq = NULL;
}
}
void Switch::send(void *tPtr,Packet &packet,bool encrypt,int32_t flowId)
{
const Address dest(packet.destination());
if (dest == RR->identity.address()) {
return;
}
_recordOutgoingPacketMetrics(packet);
if (!_trySend(tPtr,packet,encrypt,flowId)) {
{
Mutex::Lock _l(_txQueue_m);
if (_txQueue.size() >= ZT_TX_QUEUE_SIZE) {
_txQueue.pop_front();
}
_txQueue.push_back(TXQueueEntry(dest,RR->node->now(),packet,encrypt,flowId));
}
if (!RR->topology->getPeer(tPtr,dest)) {
requestWhois(tPtr,RR->node->now(),dest);
}
}
}
void Switch::requestWhois(void *tPtr,const int64_t now,const Address &addr)
{
if (addr == RR->identity.address()) {
return;
}
{
Mutex::Lock _l(_lastSentWhoisRequest_m);
int64_t &last = _lastSentWhoisRequest[addr];
if ((now - last) < ZT_WHOIS_RETRY_DELAY) {
return;
} else {
last = now;
}
}
const SharedPtr<Peer> upstream(RR->topology->getUpstreamPeer());
if (upstream) {
int32_t flowId = ZT_QOS_NO_FLOW;
Packet outp(upstream->address(),RR->identity.address(),Packet::VERB_WHOIS);
addr.appendTo(outp);
send(tPtr,outp,true,flowId);
}
}
void Switch::doAnythingWaitingForPeer(void *tPtr,const SharedPtr<Peer> &peer)
{
{
Mutex::Lock _l(_lastSentWhoisRequest_m);
_lastSentWhoisRequest.erase(peer->address());
}
const int64_t now = RR->node->now();
for(unsigned int ptr=0;ptr<ZT_RX_QUEUE_SIZE;++ptr) {
RXQueueEntry *const rq = &(_rxQueue[ptr]);
Mutex::Lock rql(rq->lock);
if ((rq->timestamp)&&(rq->complete)) {
if ((rq->frag0.tryDecode(RR,tPtr,rq->flowId))||((now - rq->timestamp) > ZT_RECEIVE_QUEUE_TIMEOUT)) {
rq->timestamp = 0;
}
}
}
{
Mutex::Lock _l(_txQueue_m);
for(std::list< TXQueueEntry >::iterator txi(_txQueue.begin());txi!=_txQueue.end();) {
if (txi->dest == peer->address()) {
if (_trySend(tPtr,txi->packet,txi->encrypt,txi->flowId)) {
_txQueue.erase(txi++);
} else {
++txi;
}
} else {
++txi;
}
}
}
}
unsigned long Switch::doTimerTasks(void *tPtr,int64_t now)
{
const uint64_t timeSinceLastCheck = now - _lastCheckedQueues;
if (timeSinceLastCheck < ZT_WHOIS_RETRY_DELAY) {
return (unsigned long)(ZT_WHOIS_RETRY_DELAY - timeSinceLastCheck);
}
_lastCheckedQueues = now;
std::vector<Address> needWhois;
{
Mutex::Lock _l(_txQueue_m);
for(std::list< TXQueueEntry >::iterator txi(_txQueue.begin());txi!=_txQueue.end();) {
if (_trySend(tPtr,txi->packet,txi->encrypt,txi->flowId)) {
_txQueue.erase(txi++);
} else if ((now - txi->creationTime) > ZT_TRANSMIT_QUEUE_TIMEOUT) {
_txQueue.erase(txi++);
} else {
if (!RR->topology->getPeer(tPtr,txi->dest)) {
needWhois.push_back(txi->dest);
}
++txi;
}
}
}
for(std::vector<Address>::const_iterator i(needWhois.begin());i!=needWhois.end();++i) {
requestWhois(tPtr,now,*i);
}
for(unsigned int ptr=0;ptr<ZT_RX_QUEUE_SIZE;++ptr) {
RXQueueEntry *const rq = &(_rxQueue[ptr]);
Mutex::Lock rql(rq->lock);
if ((rq->timestamp)&&(rq->complete)) {
if ((rq->frag0.tryDecode(RR,tPtr,rq->flowId))||((now - rq->timestamp) > ZT_RECEIVE_QUEUE_TIMEOUT)) {
rq->timestamp = 0;
} else {
const Address src(rq->frag0.source());
if (!RR->topology->getPeer(tPtr,src)) {
requestWhois(tPtr,now,src);
}
}
}
}
{
Mutex::Lock _l(_lastUniteAttempt_m);
Hashtable< _LastUniteKey,uint64_t >::Iterator i(_lastUniteAttempt);
_LastUniteKey *k = (_LastUniteKey *)0;
uint64_t *v = (uint64_t *)0;
while (i.next(k,v)) {
if ((now - *v) >= (ZT_MIN_UNITE_INTERVAL * 8)) {
_lastUniteAttempt.erase(*k);
}
}
}
{
Mutex::Lock _l(_lastSentWhoisRequest_m);
Hashtable< Address,int64_t >::Iterator i(_lastSentWhoisRequest);
Address *a = (Address *)0;
int64_t *ts = (int64_t *)0;
while (i.next(a,ts)) {
if ((now - *ts) > (ZT_WHOIS_RETRY_DELAY * 2)) {
_lastSentWhoisRequest.erase(*a);
}
}
}
return ZT_WHOIS_RETRY_DELAY;
}
bool Switch::_shouldUnite(const int64_t now,const Address &source,const Address &destination)
{
Mutex::Lock _l(_lastUniteAttempt_m);
uint64_t &ts = _lastUniteAttempt[_LastUniteKey(source,destination)];
if ((now - ts) >= ZT_MIN_UNITE_INTERVAL) {
ts = now;
return true;
}
return false;
}
bool Switch::_trySend(void *tPtr,Packet &packet,bool encrypt,int32_t flowId)
{
SharedPtr<Path> viaPath;
const int64_t now = RR->node->now();
const Address destination(packet.destination());
const SharedPtr<Peer> peer(RR->topology->getPeer(tPtr,destination));
if (peer) {
if ((peer->bondingPolicy() == ZT_BOND_POLICY_BROADCAST)
&& (packet.verb() == Packet::VERB_FRAME || packet.verb() == Packet::VERB_EXT_FRAME)) {
const SharedPtr<Peer> relay(RR->topology->getUpstreamPeer());
Mutex::Lock _l(peer->_paths_m);
for(int i=0;i<ZT_MAX_PEER_NETWORK_PATHS;++i) {
if (peer->_paths[i].p && peer->_paths[i].p->alive(now)) {
uint16_t userSpecifiedMtu = peer->_paths[i].p->mtu();
_sendViaSpecificPath(tPtr,peer,peer->_paths[i].p, userSpecifiedMtu,now,packet,encrypt,flowId);
}
}
return true;
} else {
viaPath = peer->getAppropriatePath(now,false,flowId);
if (!viaPath) {
peer->tryMemorizedPath(tPtr,now); // periodically attempt memorized or statically defined paths, if any are known
const SharedPtr<Peer> relay(RR->topology->getUpstreamPeer());
if ( (!relay) || (!(viaPath = relay->getAppropriatePath(now,false,flowId))) ) {
if (!(viaPath = peer->getAppropriatePath(now,true,flowId))) {
return false;
}
}
}
if (viaPath) {
uint16_t userSpecifiedMtu = viaPath->mtu();
_sendViaSpecificPath(tPtr,peer,viaPath,userSpecifiedMtu,now,packet,encrypt,flowId);
return true;
}
}
}
return false;
}
void Switch::_sendViaSpecificPath(void *tPtr,SharedPtr<Peer> peer,SharedPtr<Path> viaPath,uint16_t userSpecifiedMtu, int64_t now,Packet &packet,bool encrypt,int32_t flowId)
{
unsigned int mtu = ZT_DEFAULT_PHYSMTU;
uint64_t trustedPathId = 0;
RR->topology->getOutboundPathInfo(viaPath->address(),mtu,trustedPathId);
if (userSpecifiedMtu > 0) {
mtu = userSpecifiedMtu;
}
unsigned int chunkSize = std::min(packet.size(),mtu);
packet.setFragmented(chunkSize < packet.size());
if (trustedPathId) {
packet.setTrusted(trustedPathId);
} else {
if (!packet.isEncrypted()) {
packet.armor(peer->key(),encrypt,peer->aesKeysIfSupported());
}
RR->node->expectReplyTo(packet.packetId());
}
peer->recordOutgoingPacket(viaPath, packet.packetId(), packet.payloadLength(), packet.verb(), flowId, now);
if (viaPath->send(RR,tPtr,packet.data(),chunkSize,now)) {
if (chunkSize < packet.size()) {
// Too big for one packet, fragment the rest
unsigned int fragStart = chunkSize;
unsigned int remaining = packet.size() - chunkSize;
unsigned int fragsRemaining = (remaining / (mtu - ZT_PROTO_MIN_FRAGMENT_LENGTH));
if ((fragsRemaining * (mtu - ZT_PROTO_MIN_FRAGMENT_LENGTH)) < remaining) {
++fragsRemaining;
}
const unsigned int totalFragments = fragsRemaining + 1;
for(unsigned int fno=1;fno<totalFragments;++fno) {
chunkSize = std::min(remaining,(unsigned int)(mtu - ZT_PROTO_MIN_FRAGMENT_LENGTH));
Packet::Fragment frag(packet,fragStart,chunkSize,fno,totalFragments);
viaPath->send(RR,tPtr,frag.data(),frag.size(),now);
fragStart += chunkSize;
remaining -= chunkSize;
}
}
}
}
void Switch::_recordOutgoingPacketMetrics(const Packet &p) {
switch (p.verb()) {
case Packet::VERB_NOP:
Metrics::pkt_nop_out++;
break;
case Packet::VERB_HELLO:
Metrics::pkt_hello_out++;
break;
case Packet::VERB_ERROR:
Metrics::pkt_error_out++;
break;
case Packet::VERB_OK:
Metrics::pkt_ok_out++;
break;
case Packet::VERB_WHOIS:
Metrics::pkt_whois_out++;
break;
case Packet::VERB_RENDEZVOUS:
Metrics::pkt_rendezvous_out++;
break;
case Packet::VERB_FRAME:
Metrics::pkt_frame_out++;
break;
case Packet::VERB_EXT_FRAME:
Metrics::pkt_ext_frame_out++;
break;
case Packet::VERB_ECHO:
Metrics::pkt_echo_out++;
break;
case Packet::VERB_MULTICAST_LIKE:
Metrics::pkt_multicast_like_out++;
break;
case Packet::VERB_NETWORK_CREDENTIALS:
Metrics::pkt_network_credentials_out++;
break;
case Packet::VERB_NETWORK_CONFIG_REQUEST:
Metrics::pkt_network_config_request_out++;
break;
case Packet::VERB_NETWORK_CONFIG:
Metrics::pkt_network_config_out++;
break;
case Packet::VERB_MULTICAST_GATHER:
Metrics::pkt_multicast_gather_out++;
break;
case Packet::VERB_MULTICAST_FRAME:
Metrics::pkt_multicast_frame_out++;
break;
case Packet::VERB_PUSH_DIRECT_PATHS:
Metrics::pkt_push_direct_paths_out++;
break;
case Packet::VERB_ACK:
Metrics::pkt_ack_out++;
break;
case Packet::VERB_QOS_MEASUREMENT:
Metrics::pkt_qos_out++;
break;
case Packet::VERB_USER_MESSAGE:
Metrics::pkt_user_message_out++;
break;
case Packet::VERB_REMOTE_TRACE:
Metrics::pkt_remote_trace_out++;
break;
case Packet::VERB_PATH_NEGOTIATION_REQUEST:
Metrics::pkt_path_negotiation_request_out++;
break;
}
}
} // namespace ZeroTier