Ivan Kozlovic
95e4f2dfe1
Fixed accounts configuration reload
...
Issues could manifest with subscription interest not properly
propagated.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com >
2023-05-03 14:35:06 -06:00
Ivan Kozlovic
840c264f45
Cleanup use of s.opts and fixed some lock (deadlock/inversion) issues
...
One should not access s.opts directly but instead use s.getOpts().
Also, server lock needs to be released when performing an account
lookup (since this may result in server lock being acquired).
A function was calling s.LookupAccount under the client lock, which
technically creates a lock inversion situation.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com >
2023-05-03 14:09:02 -06:00
Derek Collison
b61e411b44
Fix race in reload and gateway sublist check ( #4127 )
...
Fixes the following race: during reload account sublist can be changed:
2699465596/server/reload.go (L1598-L1610)
so this can become a race while checking interest in the gateway code
here:
79de3302be/server/gateway.go (L2683)
```
=== RUN TestJetStreamSuperClusterPeerReassign
==================
WARNING: DATA RACE
Write at 0x00c0010854f0 by goroutine 15595:
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization.func2()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1610 +0x486
sync.(*Map).Range()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/sync/map.go:354 +0x225
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1594 +0x35d
github.com/nats-io/nats-server/v2/server.(*Server).applyOptions()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:1454 +0xf4
github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:908 +0x204
github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:847 +0x4a4
github.com/nats-io/nats-server/v2/server.(*Server).Reload()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/reload.go:782 +0x125
github.com/nats-io/nats-server/v2/server.(*cluster).removeJetStream()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_helpers_test.go:1498 +0x310
github.com/nats-io/nats-server/v2/server.TestJetStreamSuperClusterPeerReassign()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_super_cluster_test.go:395 +0xa38
testing.tRunner()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1446 +0x216
testing.(*T).Run.func1()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1493 +0x47
Previous read at 0x00c0010854f0 by goroutine 15875:
github.com/nats-io/nats-server/v2/server.(*Server).gatewayHandleSubjectNoInterest()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:2683 +0x12d
github.com/nats-io/nats-server/v2/server.(*client).processInboundGatewayMsg()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:2980 +0x595
github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3532 +0xc7
github.com/nats-io/nats-server/v2/server.(*client).parse()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x34f9
github.com/nats-io/nats-server/v2/server.(*client).readLoop()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1284 +0x17e8
github.com/nats-io/nats-server/v2/server.(*Server).createGateway.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:858 +0x37
Goroutine 15595 (running) created at:
testing.(*T).Run()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1493 +0x75d
testing.runTests.func1()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1846 +0x99
testing.tRunner()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1446 +0x216
testing.runTests()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1844 +0x7ec
testing.(*M).Run()
/home/travis/.gimme/versions/go1.19.8.linux.amd64/src/testing/testing.go:1726 +0xa84
github.com/nats-io/nats-server/v2/server.TestMain()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/sublist_test.go:1577 +0x292
main.main()
_testmain.go:3615 +0x324
Goroutine 15875 (running) created at:
github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3098 +0x88
github.com/nats-io/nats-server/v2/server.(*Server).createGateway()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:858 +0xfc4
github.com/nats-io/nats-server/v2/server.(*Server).startGatewayAcceptLoop.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/gateway.go:553 +0x48
github.com/nats-io/nats-server/v2/server.(*Server).acceptConnections.func1()
/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:2184 +0x58
==================
testing.go:1319: race detected during execution of test
--- FAIL: TestJetStreamSuperClusterPeerReassign (2.08s)
```
2023-05-02 18:12:56 -07:00
Waldemar Quevedo
938ffcba20
Fix race in reload and gateway sublist check
...
Signed-off-by: Waldemar Quevedo <wally@nats.io >
2023-05-02 17:51:53 -07:00
Derek Collison
8cb32930d9
Small raft improvements. ( #4126 )
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-02 17:29:34 -07:00
Derek Collison
ae73f7be55
Small raft improvements.
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-02 16:44:27 -07:00
Derek Collison
9ef71893db
Bump to 2.9.17-beta.4
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-02 09:43:11 -07:00
Derek Collison
188eea42cc
[IMPROVED] Do not hold filestore lock during remove that needs to do IO. ( #4123 )
...
When removing a msg and we need to load the msg block and incur IO,
unlock fs lock to avoid stalling other activity on other blocks. E.g
removing and adding msgs at the same time.
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-02 09:42:38 -07:00
Derek Collison
4a58feff27
When removing a msg and we need to load the msg block and incur IO, unlock fs lock to avoid stalling other activity on other blocks.
...
E.g removing and adding msgs at the same time.
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-02 08:56:43 -07:00
Derek Collison
ff6c80350b
[FIXED] A stream raft node could stay running after a stop(). ( #4118 )
...
This can happen when we reset a stream internally and the stream had a
prior snapshot.
Also make sure to always release resources back to the account
regardless if the store is no longer present.
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-01 16:23:03 -07:00
Derek Collison
f098c253aa
Make sure we adjust accounting reservations when deleting a stream with any issues.
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-01 15:54:37 -07:00
Derek Collison
f5ac5a4da0
Fix for a bug that could leave a raft node running when stopping a stream.
...
This can happen when we reset a stream internally and the stream had a prior snapshot.
Also make sure to always release resources back to the account regardless if the store is no longer present.
Signed-off-by: Derek Collison <derek@nats.io >
2023-05-01 13:22:06 -07:00
Derek Collison
1eed0e8c75
Bump to 2.9.17-beta.3
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-30 17:43:59 -07:00
Derek Collison
7ad2dd2510
[IMPROVED] Updating of a large fleet of leafnodes. ( #4117 )
...
When a fleet of leafnodes are isolated (not routed but using same
cluster) we could do better at optimizing how we update the other
leafnodes since if they are all in the same cluster and we know we are
isolated we can skip.
We can improve further in 2.10.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-30 17:32:14 -07:00
Derek Collison
c15cc0054a
When a fleet of leafnodes are isolated (not routed but using same cluster) we could do better at optimizing how we update the other leafnodes.
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-30 17:08:16 -07:00
Derek Collison
91607d8459
[IMPROVED] Health repair ( #4116 )
...
Under certain scenarios we have witnessed healthz() that will never
return healthy due to a stream or consumer being missing or stopped.
This will now allow the healthz() call to attempt to restart those
assets.
We will also periodically call this in clustered mode from the
monitorCluster routine.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 18:02:12 -07:00
Derek Collison
b27ce6de80
Add in a few more places to check on jetstream shutting down.
...
Add in a helper method.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 11:27:18 -07:00
Derek Collison
db972048ce
Detect when we are shutting down or if a consumer is already closed when removing a stream.
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 11:18:10 -07:00
Derek Collison
4eb4e5496b
Do health check on startup once we have processed existing state.
...
Also do health checks in separate go routine.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 09:36:35 -07:00
Derek Collison
fac5658966
If we fail to create a consumer, make sure to clean up any raft nodes in meta layer and to shutdown the consumer if created but we encountered an error.
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 08:15:33 -07:00
Derek Collison
546dd0c9ab
Make sure we can recover an underlying node being stopped.
...
Do not return healthy if the node is closed, and wait a bit longer for forward progress.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-29 07:42:23 -07:00
Derek Collison
85f6bfb2ac
Check healthz periodically
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-28 17:58:45 -07:00
Derek Collison
ac27fd046a
Fix data race
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-28 17:57:03 -07:00
Derek Collison
d107ba3549
Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
...
This will now allow the healthy call to attempt to restart those assets.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-28 17:11:08 -07:00
Derek Collison
c75127b966
Benchmarks for stream limits, combine tests and benchmarks into fewer files ( #4098 )
...
- [ ] Link to issue, e.g. `Resolves #NNN`
- [ ] Documentation added (if applicable)
- [x] Tests added
- [x] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [x] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html ))
- [x] Build is green in Travis CI
- [x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE )
### Changes proposed in this pull request:
- Add JetStream benchmark that measures publish throughput with
different limits (MaxBytes, MaxMessages, MaxPerSubject, MaxAge, ...)
- Merge `jetstream_chaos_*_test.go` and and
`jetstream_benchmark_*_test.go` into two files (down from 8)
Example output (filtered subset):
```
$ go test -v -bench 'BenchmarkJetStreamInterestStreamWithLimit/.*R=1.*/Storage=Memory/.*' -run NONE -benchtime 200000x -count 1
goos: darwin
goarch: arm64
pkg: github.com/nats-io/nats-server/v2/server
BenchmarkJetStreamInterestStreamWithLimit
jetstream_benchmark_interest_stream_limit_test.go:44: BatchSize: 100, MsgSize: 256, Subjects: 2500, Publishers: 4, Random Message: true
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/unlimited
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [unlimited], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/unlimited-8 200000 9743 ns/op 26.28 MB/s 0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=1000
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxMsg=1000], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=1000-8 200000 9800 ns/op 26.12 MB/s 0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=10
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxMsg=10], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=10-8 200000 9717 ns/op 26.35 MB/s 0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxPerSubject=10
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxPerSubject=10], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxPerSubject=10-8 200000 78796 ns/op 3.25 MB/s 0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxAge=1s
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxAge=1s], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxAge=1s-8 200000 9648 ns/op 26.53 MB/s 0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxBytes=1MB
jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxBytes=1MB], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxBytes=1MB-8 200000 9706 ns/op 26.38 MB/s 0 %error
PASS
ok github.com/nats-io/nats-server/v2/server 26.359s
```
cc: @jnmoyne
2023-04-27 16:53:31 -07:00
Marco Primi
82eade93b4
Merge JS Chaos tests into a single file
2023-04-27 14:56:55 -07:00
Marco Primi
7908d8c05c
Merge JS benchmarks into a single file
2023-04-27 14:56:55 -07:00
Marco Primi
df552351ec
Benchmark for interest-based stream with limits
...
Measure publish throughput with different limits (MaxBytes, MaxMessages,
MaxPerSubject, MaxAge, ...)
2023-04-27 14:56:55 -07:00
Derek Collison
f972165b0e
Bump to 2.9.17-beta.2
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-27 14:30:19 -07:00
Derek Collison
c3b07df86f
The server's Start() used to block but no longer does. ( #4111 )
...
This updates tests and the function comment.
Signed-off-by: Derek Collison <derek@nats.io >
Resolves #4110
2023-04-27 09:50:03 -07:00
Derek Collison
59e2107435
Fix test flapper
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-27 07:19:56 -07:00
Derek Collison
a66ac8cb9b
The server's Start() used to block but no longer does. This updates tests and function comment.
...
Fix for #4110
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-27 06:55:03 -07:00
Neil
3feb9f73b9
Add op type to panics ( #4108 )
...
This updates the `panic` messages on applying meta entries to include
the faulty op type, so that we can better work out what's going on in
these cases.
Signed-off-by: Neil Twigg <neil@nats.io >
2023-04-27 13:02:28 +01:00
Neil Twigg
e30ea34625
Add op type to panics
...
Signed-off-by: Neil Twigg <neil@nats.io >
2023-04-27 11:38:52 +01:00
Derek Collison
f584df4b4a
[IMPROVED] Clustered consumer improvements ( #4107 )
...
Consumer state from Jsz() would not be consistent for a leader vs
follower.
ConsumerFileStore could encode an empty state or update an empty state
on startup.
We needed to make sure at the lowest level that the state was read from
disk and not depend on the upper layer consumer.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-26 17:22:32 -07:00
Derek Collison
9999f63853
ConsumerFileStore could encode an empty state or update an empty state on startup.
...
We needed to make sure at the lowest level that the state was read from disk and not depend on upper layer consumer.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-26 15:48:10 -07:00
Derek Collison
7f06d6f5a7
When Jsz() was asked for consumer details, would report incorrect data if not a consumer leader.
...
This is due to the way state is maintained for leaders vs followers for consumers.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-26 15:03:15 -07:00
Derek Collison
aea4a4115d
Stream migration update ( #4104 )
...
I noticed that stream migration could be delayed due to transferring
leadership while the new leader was still paused for a upper layer
catchup, resulting in downgrading to a normal lost quorum vote. This
allows a leadership transfer to move ahead once the upper layer resumes.
Also check quicker but slow down if the state we need to have is not
there yet.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-26 08:14:46 -07:00
Derek Collison
83293f86ff
Reduce threshold for compressing messages during a catchup
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-25 19:01:06 -07:00
Derek Collison
3c964a12d7
Migration could be delayed due to transferring leadership while the new leader was still paused.
...
Also check quicker but slow down if the state we need to have is not there yet.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-25 18:58:49 -07:00
Neil
08d341801f
Restore outbound queue coalescing ( #4093 )
...
This PR effectively reverts part of #4084 which removed the coalescing
from the outbound queues as I initially thought it was the source of a
race condition.
Further investigation has proven that not only was that untrue (the race
actually came from the WebSocket code, all coalescing operations happen
under the client lock) but removing the coalescing also worsens
performance.
Signed-off-by: Neil Twigg <neil@nats.io >
2023-04-25 15:53:00 +01:00
Derek Collison
70b635e337
Test that makes sure that assets can change be scaled after a cluster change. ( #4101 )
...
This is specifically when a cluster is reconfigured and the servers are
restarted with a new cluster name.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-25 07:45:46 -07:00
Neil Twigg
2206f9e468
Re-add coalescing to outbound queues
...
Originally I thought there was a race condition happening here,
but it turns out it is safe after all and the race condition I
was seeing was due to other problems in the WebSocket code.
Signed-off-by: Neil Twigg <neil@nats.io >
2023-04-25 12:15:11 +01:00
Derek Collison
e25f89dc4d
Do not fail healthz in single server mode on failed snapshot restore. ( #4100 )
...
In single server mode healthz could mistake a snapshot staging
direct…ory during a restore as an account.
If the restore took a long time, stalled, or was aborted, would cause
healthz to fail.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-24 22:49:55 -07:00
Derek Collison
47c6bfded4
Update server/jetstream_test.go
...
Fix spelling
Co-authored-by: Tomasz Pietrek <tomasz@nats.io >
2023-04-24 22:29:05 -07:00
Derek Collison
3340179b97
Fix flapper
...
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-24 22:22:27 -07:00
Derek Collison
c2649beded
fix some comments ( #4099 )
2023-04-24 22:17:30 -07:00
Derek Collison
cae91b8cad
In single server mode healthz could mistake a snapshot staging directory during a restore as an account.
...
If the restore took a long time, stalled, or was aborted, would cause healthz to fail.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-24 22:14:04 -07:00
cui fliter
f1f5a59e9b
fix some comments
...
Signed-off-by: cui fliter <imcusg@gmail.com >
2023-04-25 11:28:59 +08:00
Derek Collison
c0f5b71a8f
Test that makes sure that assets that have been created under a certain cluster can be upgraded to a new cluster.
...
This is specifically when a cluster is reconfigured and the servers are restarted with a new cluster name.
Signed-off-by: Derek Collison <derek@nats.io >
2023-04-24 20:06:20 -07:00