Commit Graph

5068 Commits

Author SHA1 Message Date
Derek Collison
ed3f8be0c5 Bump version 2.10.0-beta.36
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-06 18:49:13 -07:00
Derek Collison
18244ea8cb Fix test that did not set ack policy to explicit
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-06 15:10:46 -07:00
Derek Collison
caa262513d Fix test that did not set ack policy which is needed
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-06 14:15:44 -07:00
Derek Collison
dbff40f2b6 Adopt same update from main
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-06 09:56:01 -07:00
Derek Collison
4175e4ee9c Merge branch 'main' into dev 2023-05-06 09:55:34 -07:00
Derek Collison
76f4358349 [IMPROVED] Optimizations for large single hub account leafnode fleets. (#4135)
Added a leafnode lock to allow better traversal without copying of large
leafnodes in a single hub account.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-06 09:53:08 -07:00
Derek Collison
80db7a22ab Optimizations for large single hub account leafnode fleets.
Added a leafnode lock to allow better traversal without copying of large leafnodes in a single hub account.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-05 13:14:49 -07:00
Waldemar Quevedo
b886fed2fb Stop using UTC for time for flushClients
In #1943 it was adopted to use `UTC()` in some timestamps,
but an unintended side effect from this is that it strips 
the monotonic time, so it can be prone to clock skews when
subtracting time in other areas of the code.
e5646b23de
2023-05-04 15:50:45 -07:00
Tomasz Pietrek
69fb3db0f5 Optimize consumer messages sequences for multiple subjects (#4129)
If consumer with multiple subjects encountered a sequnece of messages in
a row from the same subject, it tried to load messages from other
subjects in some cases.
This checks for that scenario and optimizes it by early returning.

I added a temporary instrumentation to check for how many times fetching
new messages is called, and it seems that it cuts those calls according
to assumptions. Though it being internal, it's really hard to show that
in test.

Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-05-04 20:13:13 +02:00
Tomasz Pietrek
7c1c4ea5fb Optimize consumer messages sequences for multiple subjects
If consumer with multiple subjects encountered a sequnece
of messages from the same subject, it tried to load messages
from other subjects in some cases.
This checks for that scenario and optimizes it by early returning.

Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-05-04 16:02:19 +02:00
Derek Collison
9fa724cd7b Merge branch 'main' into dev 2023-05-03 21:00:35 -07:00
Derek Collison
da8aeac91b Fix flapper
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-03 21:00:17 -07:00
Derek Collison
68f6b59fc7 Merge branch 'main' into dev 2023-05-03 19:51:24 -07:00
Derek Collison
ae73e6a573 Bump to 2.9.17-beta.5
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-03 19:50:21 -07:00
Derek Collison
21239022bd Protect against usage drift for any unforseen reason and if detected correct.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-03 17:09:06 -07:00
Ivan Kozlovic
311e3feb5f Merge branch 'main' into dev 2023-05-03 17:38:40 -06:00
Ivan Kozlovic
8a4ead22bc Updates based on code review
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-03 16:14:51 -06:00
Ivan Kozlovic
7afe76caf8 Fixed Sublist.RemoveBatch to remove subs present, even if one isn't
I have seen cases, maybe due to previous issue with configuration
reload that would miss subscriptions in the sublist because
of the sublist swap, where we would attempt to remove subscriptions
by batch but some were not present. I would have expected that
all present subscriptions would still be removed, even if the
call overall returned an error.
This is now fixed and a test has been added demonstrating that
even on error, we remove all subscriptions that were present.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-03 15:21:26 -06:00
Ivan Kozlovic
95e4f2dfe1 Fixed accounts configuration reload
Issues could manifest with subscription interest not properly
propagated.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-03 14:35:06 -06:00
Ivan Kozlovic
840c264f45 Cleanup use of s.opts and fixed some lock (deadlock/inversion) issues
One should not access s.opts directly but instead use s.getOpts().
Also, server lock needs to be released when performing an account
lookup (since this may result in server lock being acquired).
A function was calling s.LookupAccount under the client lock, which
technically creates a lock inversion situation.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-03 14:09:02 -06:00
Derek Collison
b61e411b44 Fix race in reload and gateway sublist check (#4127)
Fixes the following race: during reload account sublist can be changed:
2699465596/server/reload.go (L1598-L1610)
so this can become a race while checking interest in the gateway code
here:
79de3302be/server/gateway.go (L2683)

```
=== RUN   TestJetStreamSuperClusterPeerReassign
==================
WARNING: DATA RACE
Write at 0x00c0010854f0 by goroutine 15595:
  github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization.func2()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1610 +0x486
  sync.(*Map).Range()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/sync/map.go:354 +0x225
  github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1594 +0x35d
  github.com/nats-io/nats-server/v2/server.(*Server).applyOptions()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1454 +0xf4
  github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:908 +0x204
  github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:847 +0x4a4
  github.com/nats-io/nats-server/v2/server.(*Server).Reload()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:782 +0x125
  github.com/nats-io/nats-server/v2/server.(*cluster).removeJetStream()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_helpers_test.go:1498 +0x310
  github.com/nats-io/nats-server/v2/server.TestJetStreamSuperClusterPeerReassign()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_super_cluster_test.go:395 +0xa38
  testing.tRunner()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1493 +0x47
Previous read at 0x00c0010854f0 by goroutine 15875:
  github.com/nats-io/nats-server/v2/server.(*Server).gatewayHandleSubjectNoInterest()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:2683 +0x12d
  github.com/nats-io/nats-server/v2/server.(*client).processInboundGatewayMsg()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:2980 +0x595
  github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3532 +0xc7
  github.com/nats-io/nats-server/v2/server.(*client).parse()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x34f9
  github.com/nats-io/nats-server/v2/server.(*client).readLoop()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1284 +0x17e8
  github.com/nats-io/nats-server/v2/server.(*Server).createGateway.func1()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:858 +0x37
Goroutine 15595 (running) created at:
  testing.(*T).Run()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1493 +0x75d
  testing.runTests.func1()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1846 +0x99
  testing.tRunner()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1446 +0x216
  testing.runTests()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1844 +0x7ec
  testing.(*M).Run()
      /home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1726 +0xa84
  github.com/nats-io/nats-server/v2/server.TestMain()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/sublist_test.go:1577 +0x292
  main.main()
      _testmain.go:3615 +0x324
Goroutine 15875 (running) created at:
  github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3098 +0x88
  github.com/nats-io/nats-server/v2/server.(*Server).createGateway()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:858 +0xfc4
  github.com/nats-io/nats-server/v2/server.(*Server).startGatewayAcceptLoop.func1()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:553 +0x48
  github.com/nats-io/nats-server/v2/server.(*Server).acceptConnections.func1()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2184 +0x58
==================
    testing.go:1319: race detected during execution of test
--- FAIL: TestJetStreamSuperClusterPeerReassign (2.08s)
```
2023-05-02 18:12:56 -07:00
Waldemar Quevedo
938ffcba20 Fix race in reload and gateway sublist check
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-05-02 17:51:53 -07:00
Derek Collison
ae73f7be55 Small raft improvements.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-02 16:44:27 -07:00
Derek Collison
e7b01c4154 Merge branch 'main' into dev 2023-05-02 16:30:00 -07:00
Derek Collison
9ef71893db Bump to 2.9.17-beta.4
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-02 09:43:11 -07:00
Derek Collison
4a58feff27 When removing a msg and we need to load the msg block and incur IO, unlock fs lock to avoid stalling other activity on other blocks.
E.g removing and adding msgs at the same time.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-02 08:56:43 -07:00
Derek Collison
eb1eb3c49e Merge branch 'main' into dev 2023-05-01 16:29:35 -07:00
Derek Collison
f098c253aa Make sure we adjust accounting reservations when deleting a stream with any issues.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-01 15:54:37 -07:00
Ivan Kozlovic
0a02f2121c [ADDED] LeafNode: TLSHandhsakeFirst option
A new field in `tls{}` blocks force the server to do TLS handshake
before sending the INFO protocol.
```
leafnodes {
   port: 7422
   tls {
      cert_file: ...
      ...
      handshake_first: true
   }
   remotes [
       {
         url: tls://host:7423
         tls {
            ...
            handshake_first: true
         }
       }
   ]
}
```
Note that if `handshake_first` is set in the "accept" side, the
first `tls{}` block in the example above, a server trying to
create a LeafNode connection to this server would need to have
`handshake_first` set to true inside the `tls{}` block of
the corresponding remote.

Configuration reload of leafnodes is generally not supported,
but TLS certificates can be reloaded and the support for this
new field was also added.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-01 16:41:51 -06:00
Derek Collison
f5ac5a4da0 Fix for a bug that could leave a raft node running when stopping a stream.
This can happen when we reset a stream internally and the stream had a prior snapshot.

Also make sure to always release resources back to the account regardless if the store is no longer present.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-01 13:22:06 -07:00
Derek Collison
1eed0e8c75 Bump to 2.9.17-beta.3
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-30 17:43:59 -07:00
Derek Collison
e158c46884 Merge branch 'main' into dev 2023-04-30 17:37:47 -07:00
Derek Collison
c15cc0054a When a fleet of leafnodes are isolated (not routed but using same cluster) we could do better at optimizing how we update the other leafnodes.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-30 17:08:16 -07:00
Derek Collison
0321eb6484 Merge branch 'main' into dev 2023-04-29 19:52:57 -07:00
Derek Collison
b27ce6de80 Add in a few more places to check on jetstream shutting down.
Add in a helper method.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 11:27:18 -07:00
Derek Collison
db972048ce Detect when we are shutting down or if a consumer is already closed when removing a stream.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 11:18:10 -07:00
Derek Collison
4eb4e5496b Do health check on startup once we have processed existing state.
Also do health checks in separate go routine.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 09:36:35 -07:00
Derek Collison
fac5658966 If we fail to create a consumer, make sure to clean up any raft nodes in meta layer and to shutdown the consumer if created but we encountered an error.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 08:15:33 -07:00
Derek Collison
546dd0c9ab Make sure we can recover an underlying node being stopped.
Do not return healthy if the node is closed, and wait a bit longer for forward progress.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 07:42:23 -07:00
Derek Collison
85f6bfb2ac Check healthz periodically
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:58:45 -07:00
Derek Collison
ac27fd046a Fix data race
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:57:03 -07:00
Derek Collison
d107ba3549 Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
This will now allow the healthy call to attempt to restart those assets.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:11:08 -07:00
Ivan Kozlovic
349f01e86a Change the absence of compression setting to default to "accept"
In that mode, a server accepts and will switch to same compression
level than the remote (if one is set) but will not initiate compression.
So if all servers in a cluster do not have compression setting set,
it defaults to "accept" which means that compression is "off".

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-28 15:33:17 -06:00
Ivan Kozlovic
5b8c9ee364 Changes based on code review
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-28 14:34:32 -06:00
Ivan Kozlovic
70af04a63f Other flappers.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-28 11:22:04 -06:00
Ivan Kozlovic
73ed55ae5b Fixed flapper
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-28 10:55:32 -06:00
Ivan Kozlovic
8d2683a062 Fixed data race
Reverts changes made in PR#4001: 105237cba8 (diff-1322a81c43dfdd05284ae128c43d9ea51c1a3b677587686561ef6de47024e14aR1340)

Since a fix was made here: b78ec39b1f
the changes made in PR need to be reverted. The test
TestRoutePoolAndPerAccountWithServiceLatencyNoDataRace now passes.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-28 10:18:14 -06:00
Ivan Kozlovic
d6fe9d4c2d [ADDED] Support for route S2 compression
The new field `compression` in the `cluster{}` block allows to
specify which compression mode to use between servers.

It can be simply specified as a boolean or a string for the
simple modes, or as an object for the "s2_auto" mode where
a list of RTT thresholds can be specified.

By default, if no compression field is specified, the server
will use the s2_auto mode with default RTT thresholds of
10ms, 50ms and 100ms for the "uncompressed", "fast", "better"
and "best" modes.

```
cluster {
..
  # Possible values are "disabled", "off", "enabled", "on",
  # "accept", "s2_fast", "s2_better", "s2_best" or "s2_auto"
  compression: s2_fast
}
```

To specify a different list of thresholds for the s2_auto,
here is how it would look like:
```
cluster {
..
  compression: {
    mode: s2_auto
    # This means that for RTT up to 5ms (included), then
    # the compression level will be "uncompressed", then
    # from 5ms+ to 15ms, the mode will switch to "s2_fast",
    # then from 15ms+ to 50ms, the level will switch to
    # "s2_better", and anything above 50ms will result
    # in the "s2_best" compression mode.
    rtt_thresholds: [5ms, 15ms, 50ms]
  }
}
```

Note that the "accept" mode means that a server will accept
compression from a remote and switch to that same compression
mode, but will otherwise not initiate compression. That is,
if 2 servers are configured with "accept", then compression
will actually be "off". If one of the server had say s2_fast
then they would both use this mode.

If a server has compression mode set (other than "off") but
connects to an older server, there will be no compression between
those 2 routes.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-27 17:59:25 -06:00
Marco Primi
82eade93b4 Merge JS Chaos tests into a single file 2023-04-27 14:56:55 -07:00
Marco Primi
7908d8c05c Merge JS benchmarks into a single file 2023-04-27 14:56:55 -07:00