A test TestJetStreamClusterLeafNodeSPOFMigrateLeaders was added at
some point that needed the remotes to stop (re)connecting. It made
use of existing leafNodeEnabled that was used for GW/Leaf interest
propagation races to disable the reconnect, but that may not be
the best approach since it could affect users embedding servers
and adding leafnodes "dynamically".
So this PR introduced a specific boolean specific for that test.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
'Chaos' is a new a group of test that validates behavior in presence of
random failures.
Overview:
- Introduce a 'Chaos Monkey' controller which can unleash a monkey
against a test cluster.
- Introduce a monkey of type 'ClusterBouncer' which stops and restarts
nodes according to some configuration
- Add 2 example tests, they ensure a cluster can survive some amount of
nodes bouncing
- Configure the build to skip chaos tests unless explicitly requested
- Add some test utility functions
This allows a solciting leafnode config to ask that any JetStream cluster assets that are a current leader have the leader stepdown.
Signed-off-by: Derek Collison <derek@nats.io>
This could happen when a consumer had not sent anything to the
attached NATS subscription and there was a consumer leader
step down or server restart.
Signed-off-by: Derek Collison <derek@nats.io>
If a JS API request is received from a non client connection, it
was processed in its own go routine. To reduce the number of
such go routine, we were limiting the number of outstanding routines
to 4096. However, in some situations, it was possible to issue
many requests at the same time that would then cause those requests
to be dropped.
(an example was an MQTT benchmark tool that would create 5000
sessions, each with one QoS1 R1 consumer (with the use of consumer_replicas=1).
On abrupt exit of the tool, the consumers and their sessions needed
to be deleted. Since would cause fast incoming delete consumer requests
which would cause the original code to drop some of them)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
With inlining election timeout updates we double the lock contention and most likely introduced head of line issues for routes under heavy load.
Also slowing down heartbeats with so many assets being deployed in our user ecosystem, also moved the normal follower to candidate timing further out, similar to the lost quorum.
Note that the happy path transfer will still be very quick.
Signed-off-by: Derek Collison <derek@nats.io>
- Remove code coverage from Travis and add it to a GitHub Action
that will be run as a nightly.
- Use tag builds to exclude some tests, such as the "norace" or
JS tests. Since "go test" does not support "negative" regexs, there
is no other way.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>