Commit Graph

38 Commits

Author SHA1 Message Date
Ivan Kozlovic
a84ce61a93 [FIXED] Account resolver lock inversion
There was a lock inversion but low risk since it happened during
server initialization. Still fixed it and added the ordering
in locksordering.txt file.

Also fixed multiple lock inversions that were caused by tests.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-09-25 15:09:11 -06:00
Waldemar Quevedo
27245891f2 Add test for scaling replica with pull consumers
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2023-09-18 12:26:05 -07:00
Neil Twigg
904f4c388e De-flake TestJetStreamClusterAccountPurge by waiting for account to exist
Signed-off-by: Neil Twigg <neil@nats.io>
2023-09-14 11:40:30 +01:00
Derek Collison
e158c46884 Merge branch 'main' into dev 2023-04-30 17:37:47 -07:00
Derek Collison
c15cc0054a When a fleet of leafnodes are isolated (not routed but using same cluster) we could do better at optimizing how we update the other leafnodes.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-30 17:08:16 -07:00
Ivan Kozlovic
d6fe9d4c2d [ADDED] Support for route S2 compression
The new field `compression` in the `cluster{}` block allows to
specify which compression mode to use between servers.

It can be simply specified as a boolean or a string for the
simple modes, or as an object for the "s2_auto" mode where
a list of RTT thresholds can be specified.

By default, if no compression field is specified, the server
will use the s2_auto mode with default RTT thresholds of
10ms, 50ms and 100ms for the "uncompressed", "fast", "better"
and "best" modes.

```
cluster {
..
  # Possible values are "disabled", "off", "enabled", "on",
  # "accept", "s2_fast", "s2_better", "s2_best" or "s2_auto"
  compression: s2_fast
}
```

To specify a different list of thresholds for the s2_auto,
here is how it would look like:
```
cluster {
..
  compression: {
    mode: s2_auto
    # This means that for RTT up to 5ms (included), then
    # the compression level will be "uncompressed", then
    # from 5ms+ to 15ms, the mode will switch to "s2_fast",
    # then from 15ms+ to 50ms, the level will switch to
    # "s2_better", and anything above 50ms will result
    # in the "s2_best" compression mode.
    rtt_thresholds: [5ms, 15ms, 50ms]
  }
}
```

Note that the "accept" mode means that a server will accept
compression from a remote and switch to that same compression
mode, but will otherwise not initiate compression. That is,
if 2 servers are configured with "accept", then compression
will actually be "off". If one of the server had say s2_fast
then they would both use this mode.

If a server has compression mode set (other than "off") but
connects to an older server, there will be no compression between
those 2 routes.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-27 17:59:25 -06:00
Derek Collison
e97ddcd14f Tweak tests due to changes, make test timeouts uniform.
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-29 12:43:59 -07:00
Derek Collison
52fbac644c Since we no longer store leaderTransfers, which is proper, some tests were getting and advantage on that after server restart.
This change speeds up raft layer more to avoid timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-29 12:43:57 -07:00
Derek Collison
5a16f98427 Fixed an off by one bug that under certain circumstances could cause large consumer replica states.
This could lead to instability in the system.

The bug would manifest in replicated consumers when certain messages could be acked out of order, and, the pending list would never go to zero.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-19 10:41:59 -07:00
Derek Collison
ad53d455f8 When migrating leaders off a server when the leafnode is not connected, also ensure leaders can not return until reconnected.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-05 08:02:50 -08:00
Neil Twigg
14d0ba1c65 Fix some lint errors after move to golangci-lint 2022-12-30 20:00:08 +00:00
Marco Primi
f8a030bc4a Use testing.TempDir() where possible
Refactor tests to use go built-in temporary directory utility for tests.

Also avoid binding to default port (which may be in use)
2022-12-12 13:18:44 -08:00
Ivan Kozlovic
90e9c89594 Added specific tests for using non system extended setup similar to NGS
Signed-off-by: Derek Collison derek@nats.io
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-17 10:42:03 -06:00
Derek Collison
2d737edba6 Allow discard new per subject for certain KV type scenarios. Requires general DiscardNewPolicy.
Signed-off-by: Derek Collison <derek@nats.io>
2022-09-22 06:38:29 -07:00
Marco Primi
e0821cfa3d Compose cluster URL without trailing comma
Related to: https://github.com/nats-io/nats.go/issues/1056
2022-08-31 14:52:45 -07:00
Marco Primi
f1883561ee Use testing.TB interface instead of *T
Using interface allows reusing helper function in benchmarks
2022-08-31 14:52:45 -07:00
Marco Primi
28d38c0213 Move bitset test utility
From Chaos helpers (excluded from build by default) to JetStream helpers
2022-08-31 14:52:45 -07:00
Derek Collison
98bf861a7a Updates to stream and consumer move logic.
Signed-off-by: Derek Collison <derek@nats.io>
2022-08-30 16:11:35 -07:00
Marco Primi
02a34117e4 Add chaos tests for Ordered, Async, Pull, Durable consumers
Tests consists of a single client trying to consume a fixed number of messages in a stream.
While the cluster is being bounced by a chaos monkey.
2022-08-16 09:52:48 -07:00
Ivan Kozlovic
a4bf4e87f6 Merge pull request #3326 from mfaizanse/health_endpoint_params
Added param options to /healthz endpoint
2022-08-09 08:49:22 -06:00
Muhammad Faizan
1634f33de7 Added param options to /healthz endpoint 2022-08-09 08:32:54 +02:00
Derek Collison
758b733d43 Attempt to improve long RTT catchup time during stream moves.
Signed-off-by: Derek Collison <derek@nats.io>
2022-08-08 11:06:10 -06:00
Marco Primi
be460b7bf1 Exclude chaos tests from build by default
Before: build chaos tests unless `skip_js_chaos_tests` is set
After: exclude chaos tests unless `js_chaos_tests` is set
2022-08-05 15:20:09 -07:00
Ivan Kozlovic
3c9a7cc6e5 Move to Go 1.19, remote io/util, fix data race and a flapper
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-05 09:55:37 -06:00
Ivan Kozlovic
d84d9f8288 Use specific boolean for a leaf test instead of using leafNodeEnabled
A test TestJetStreamClusterLeafNodeSPOFMigrateLeaders was added at
some point that needed the remotes to stop (re)connecting. It made
use of existing leafNodeEnabled that was used for GW/Leaf interest
propagation races to disable the reconnect, but that may not be
the best approach since it could affect users embedding servers
and adding leafnodes "dynamically".

So this PR introduced a specific boolean specific for that test.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-04 10:00:11 -06:00
Marco Primi
d83e0e2b25 Add 'chaos' test utility and 2 example tests
'Chaos' is a new a group of test that validates behavior in presence of
random failures.

Overview:
 - Introduce a 'Chaos Monkey' controller which can unleash a monkey
against a test cluster.
 - Introduce a monkey of type 'ClusterBouncer' which stops and restarts
nodes according to some configuration
 - Add 2 example tests, they ensure a cluster can survive some amount of
nodes bouncing
 - Configure the build to skip chaos tests unless explicitly requested
 - Add some test utility functions
2022-08-03 14:01:56 -07:00
Derek Collison
8dc1e4b6de When compact would reclaim head of block space, we needed to update block key for counter for new writes.
Signed-off-by: Derek Collison <derek@nats.io>
2022-07-30 13:05:41 -07:00
Derek Collison
c14fda51e7 Direct access to JetStream resources would be affected if across a leafnode that was down.
This allows a solciting leafnode config to ask that any JetStream cluster assets that are a current leader have the leader stepdown.

Signed-off-by: Derek Collison <derek@nats.io>
2022-07-05 12:35:09 -07:00
Derek Collison
9b7c81c37e Some tests improvements on non-standard JS cluster setup
Signed-off-by: Derek Collison <derek@nats.io>
2022-07-03 12:45:27 -07:00
Derek Collison
47bef915ed Allow all members of a replicated stream to participate in direct access.
We will wait until a non-leader replica is current to subscribe.

Signed-off-by: Derek Collison <derek@nats.io>
2022-07-03 11:08:24 -07:00
Derek Collison
e6479dafd2 Close leafnode connection when same cluster name detected
Signed-off-by: Derek Collison <derek@nats.io>
2022-06-30 15:34:22 -07:00
Derek Collison
4c8110c3ff Add in support for inactivity thresholds for durable consumers.
Signed-off-by: Derek Collison <derek@nats.io>
2022-06-14 06:51:00 -07:00
Derek Collison
e08f6d863d Allow for republish to be headers only
Signed-off-by: Derek Collison <derek@nats.io>
2022-05-30 12:05:17 -07:00
Ivan Kozlovic
53e3c53d96 [FIXED] JetStream: consumer with deliver new may miss messages
This could happen when a consumer had not sent anything to the
attached NATS subscription and there was a consumer leader
step down or server restart.

Signed-off-by: Derek Collison <derek@nats.io>
2022-05-23 12:01:48 -06:00
Ivan Kozlovic
4bf81420e2 [FIXED] Fast routed JetStream API requests were dropped
If a JS API request is received from a non client connection, it
was processed in its own go routine. To reduce the number of
such go routine, we were limiting the number of outstanding routines
to 4096. However, in some situations, it was possible to issue
many requests at the same time that would then cause those requests
to be dropped.

(an example was an MQTT benchmark tool that would create 5000
sessions, each with one QoS1 R1 consumer (with the use of consumer_replicas=1).
On abrupt exit of the tool, the consumers and their sessions needed
to be deleted. Since would cause fast incoming delete consumer requests
which would cause the original code to drop some of them)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-05-23 11:15:55 -06:00
Derek Collison
ef3eea4d73 Speed up raft for tests
Signed-off-by: Derek Collison <derek@nats.io>
2022-05-18 16:28:58 -07:00
Derek Collison
ccd2290355 With use cases bringing us more data I wanted to suggest these changes.
With inlining election timeout updates we double the lock contention and most likely introduced head of line issues for routes under heavy load.
Also slowing down heartbeats with so many assets being deployed in our user ecosystem, also moved the normal follower to candidate timing further out, similar to the lost quorum.
Note that the happy path transfer will still be very quick.

Signed-off-by: Derek Collison <derek@nats.io>
2022-05-15 09:55:22 -07:00
Ivan Kozlovic
0e2ab5eeea Changes to tests that run on Travis
- Remove code coverage from Travis and add it to a GitHub Action
that will be run as a nightly.
- Use tag builds to exclude some tests, such as the "norace" or
JS tests. Since "go test" does not support "negative" regexs, there
is no other way.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-26 14:11:31 -06:00