Commit Graph

7115 Commits

Author SHA1 Message Date
Derek Collison
4eb4e5496b Do health check on startup once we have processed existing state.
Also do health checks in separate go routine.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 09:36:35 -07:00
Derek Collison
fac5658966 If we fail to create a consumer, make sure to clean up any raft nodes in meta layer and to shutdown the consumer if created but we encountered an error.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 08:15:33 -07:00
Derek Collison
546dd0c9ab Make sure we can recover an underlying node being stopped.
Do not return healthy if the node is closed, and wait a bit longer for forward progress.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 07:42:23 -07:00
Derek Collison
85f6bfb2ac Check healthz periodically
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:58:45 -07:00
Derek Collison
ac27fd046a Fix data race
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:57:03 -07:00
Derek Collison
d107ba3549 Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
This will now allow the healthy call to attempt to restart those assets.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:11:08 -07:00
Derek Collison
c75127b966 Benchmarks for stream limits, combine tests and benchmarks into fewer files (#4098)
- [ ] Link to issue, e.g. `Resolves #NNN`
 - [ ] Documentation added (if applicable)
 - [x] Tests added
- [x] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [x] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
 - [x] Build is green in Travis CI
- [x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)

### Changes proposed in this pull request:

- Add JetStream benchmark that measures publish throughput with
different limits (MaxBytes, MaxMessages, MaxPerSubject, MaxAge, ...)
- Merge `jetstream_chaos_*_test.go` and and
`jetstream_benchmark_*_test.go` into two files (down from 8)

Example output (filtered subset):

```
$ go test -v -bench 'BenchmarkJetStreamInterestStreamWithLimit/.*R=1.*/Storage=Memory/.*' -run NONE -benchtime 200000x -count 1

goos: darwin
goarch: arm64
pkg: github.com/nats-io/nats-server/v2/server
BenchmarkJetStreamInterestStreamWithLimit
    jetstream_benchmark_interest_stream_limit_test.go:44: BatchSize: 100, MsgSize: 256, Subjects: 2500, Publishers: 4, Random Message: true
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/unlimited
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [unlimited], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/unlimited-8              200000              9743 ns/op          26.28 MB/s             0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=1000
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxMsg=1000], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=1000-8            200000              9800 ns/op          26.12 MB/s             0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=10
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxMsg=10], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxMsg=10-8              200000              9717 ns/op          26.35 MB/s             0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxPerSubject=10
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxPerSubject=10], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxPerSubject=10-8       200000             78796 ns/op           3.25 MB/s             0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxAge=1s
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxAge=1s], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxAge=1s-8              200000              9648 ns/op          26.53 MB/s             0 %error
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxBytes=1MB
    jetstream_benchmark_interest_stream_limit_test.go:230: Stream: {clusterSize:1 replicas:1}, Storage: [Memory] Limit: [MaxBytes=1MB], Ops: 200000
BenchmarkJetStreamInterestStreamWithLimit/N=1,R=1/Storage=Memory/MaxBytes=1MB-8           200000              9706 ns/op          26.38 MB/s             0 %error
PASS
ok      github.com/nats-io/nats-server/v2/server        26.359s
```

cc: @jnmoyne
2023-04-27 16:53:31 -07:00
Marco Primi
82eade93b4 Merge JS Chaos tests into a single file 2023-04-27 14:56:55 -07:00
Marco Primi
7908d8c05c Merge JS benchmarks into a single file 2023-04-27 14:56:55 -07:00
Marco Primi
df552351ec Benchmark for interest-based stream with limits
Measure publish throughput with different limits (MaxBytes, MaxMessages,
MaxPerSubject, MaxAge, ...)
2023-04-27 14:56:55 -07:00
Derek Collison
f972165b0e Bump to 2.9.17-beta.2
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-27 14:30:19 -07:00
Derek Collison
c3b07df86f The server's Start() used to block but no longer does. (#4111)
This updates tests and the function comment.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4110
2023-04-27 09:50:03 -07:00
Derek Collison
59e2107435 Fix test flapper
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-27 07:19:56 -07:00
Derek Collison
a66ac8cb9b The server's Start() used to block but no longer does. This updates tests and function comment.
Fix for #4110

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-27 06:55:03 -07:00
Neil
3feb9f73b9 Add op type to panics (#4108)
This updates the `panic` messages on applying meta entries to include
the faulty op type, so that we can better work out what's going on in
these cases.

Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-27 13:02:28 +01:00
Neil Twigg
e30ea34625 Add op type to panics
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-27 11:38:52 +01:00
Derek Collison
f584df4b4a [IMPROVED] Clustered consumer improvements (#4107)
Consumer state from Jsz() would not be consistent for a leader vs
follower.

ConsumerFileStore could encode an empty state or update an empty state
on startup.
We needed to make sure at the lowest level that the state was read from
disk and not depend on the upper layer consumer.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-26 17:22:32 -07:00
Derek Collison
9999f63853 ConsumerFileStore could encode an empty state or update an empty state on startup.
We needed to make sure at the lowest level that the state was read from disk and not depend on upper layer consumer.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-26 15:48:10 -07:00
Derek Collison
7f06d6f5a7 When Jsz() was asked for consumer details, would report incorrect data if not a consumer leader.
This is due to the way state is maintained for leaders vs followers for consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-26 15:03:15 -07:00
Derek Collison
aea4a4115d Stream migration update (#4104)
I noticed that stream migration could be delayed due to transferring
leadership while the new leader was still paused for a upper layer
catchup, resulting in downgrading to a normal lost quorum vote. This
allows a leadership transfer to move ahead once the upper layer resumes.
Also check quicker but slow down if the state we need to have is not
there yet.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-26 08:14:46 -07:00
Derek Collison
83293f86ff Reduce threshold for compressing messages during a catchup
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-25 19:01:06 -07:00
Derek Collison
3c964a12d7 Migration could be delayed due to transferring leadership while the new leader was still paused.
Also check quicker but slow down if the state we need to have is not there yet.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-25 18:58:49 -07:00
Neil
08d341801f Restore outbound queue coalescing (#4093)
This PR effectively reverts part of #4084 which removed the coalescing
from the outbound queues as I initially thought it was the source of a
race condition.

Further investigation has proven that not only was that untrue (the race
actually came from the WebSocket code, all coalescing operations happen
under the client lock) but removing the coalescing also worsens
performance.

Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-25 15:53:00 +01:00
Derek Collison
70b635e337 Test that makes sure that assets can change be scaled after a cluster change. (#4101)
This is specifically when a cluster is reconfigured and the servers are
restarted with a new cluster name.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-25 07:45:46 -07:00
Neil Twigg
2206f9e468 Re-add coalescing to outbound queues
Originally I thought there was a race condition happening here,
but it turns out it is safe after all and the race condition I
was seeing was due to other problems in the WebSocket code.

Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-25 12:15:11 +01:00
Derek Collison
e25f89dc4d Do not fail healthz in single server mode on failed snapshot restore. (#4100)
In single server mode healthz could mistake a snapshot staging
direct…ory during a restore as an account.
If the restore took a long time, stalled, or was aborted, would cause
healthz to fail.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 22:49:55 -07:00
Derek Collison
47c6bfded4 Update server/jetstream_test.go
Fix spelling

Co-authored-by: Tomasz Pietrek <tomasz@nats.io>
2023-04-24 22:29:05 -07:00
Derek Collison
3340179b97 Fix flapper
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 22:22:27 -07:00
Derek Collison
c2649beded fix some comments (#4099) 2023-04-24 22:17:30 -07:00
Derek Collison
cae91b8cad In single server mode healthz could mistake a snapshot staging directory during a restore as an account.
If the restore took a long time, stalled, or was aborted, would cause healthz to fail.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 22:14:04 -07:00
cui fliter
f1f5a59e9b fix some comments
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-04-25 11:28:59 +08:00
Derek Collison
c0f5b71a8f Test that makes sure that assets that have been created under a certain cluster can be upgraded to a new cluster.
This is specifically when a cluster is reconfigured and the servers are restarted with a new cluster name.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 20:06:20 -07:00
Derek Collison
b0d98df759 fix formatting of raft debug log (#4090)
Fixes this entry showing up:

```
[36571] 2023/04/22 06:53:37.278744 [DBG] RAFT [2DfzK1X6 - C-R3F-bzb7ShZ0] Stepping down from %!s(uint64=66), detected higher term: 65 vs %!d(string=leader)
```
2023-04-21 22:33:08 -07:00
Waldemar Quevedo
d9cc8b0363 fix formatting of raft debug log
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-04-22 07:07:08 +02:00
Derek Collison
0e2eccd188 Update to the compress lib (#4088)
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 14:36:26 -07:00
Derek Collison
acc8e69f23 Update to the compress lib
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 14:18:54 -07:00
Derek Collison
04908962a1 Swap out flate from std library for faster one from compress. (#4087)
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 14:02:43 -07:00
Derek Collison
50522f117d New version of flate needed more payload at best speed to kick in
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 13:18:25 -07:00
Derek Collison
f9f4bf5c40 Run a check for ack floor drift. (#4086)
Also periodically check. If all normal will be very cheap.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 12:56:53 -07:00
Derek Collison
da9a17fd68 Spelling
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 12:40:19 -07:00
Derek Collison
57d06abbc9 Swap out flate from std for faster one
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 12:12:16 -07:00
Derek Collison
8b7c2d12aa Run a check for ack floor drift when taking over as a leader and the ack go routine is spun up.
Also periodically check. If all normal will be very cheap.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 11:59:35 -07:00
Derek Collison
01041ca1a6 Outbound queue improvements (#4084)
This extends the previous work in #3733 with the following:

1. Remove buffer coalescing, as this could result in a race condition
during the `writev` syscall in rare circumstances
2. Add a third buffer size, to ensure that we aren't allocating more
than we need to without coalescing
3. Refactor buffer handling in the WebSocket code to reduce allocations
and ensure owned buffers aren't incorrectly being pooled resulting in
further race conditions

Fixes nats-io/nats.ws#194.

Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-21 10:06:39 -07:00
Neil Twigg
5f884349db Remove TestClientOutboundQueueCoalesce as no longer needed
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-21 15:40:49 +01:00
Neil Twigg
2ece00b08f Buffer re-use in WebSocket code, fix race conditions
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-21 15:33:48 +01:00
Neil Twigg
bf286744dd Remove coalescing as it races with the writev syscall
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-20 23:29:36 +01:00
Derek Collison
e96ae0bf79 A simpler stream state to detect change for snapshots. (#4074)
A stream could have a complicated state with interior deletes.
This is a simpler way to determine if we need to consider a snapshot
that involves much less time and CPU and memory.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-19 07:16:20 -07:00
Derek Collison
f6195a5ee3 A stream could have a complicated state with interior deletes.
This is a simpler way to determine if we need to consider a snapshot that involves much less time and CPU and memory.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-18 19:11:49 -07:00
Derek Collison
c43c216415 Bump to 2.9.17-beta.1
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-18 18:55:13 -07:00
Waldemar Quevedo
f84ca24fcc 2.9.16 release (#4065)
Don't mind the "bullet" character changes.. that was my Prettier
auto-formatting.
2023-04-17 07:50:08 -07:00