Commit Graph

84 Commits

Author SHA1 Message Date
Derek Collison
da8aeac91b Fix flapper
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-03 21:00:17 -07:00
Derek Collison
21239022bd Protect against usage drift for any unforseen reason and if detected correct.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-03 17:09:06 -07:00
Derek Collison
f098c253aa Make sure we adjust accounting reservations when deleting a stream with any issues.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-01 15:54:37 -07:00
Derek Collison
f5ac5a4da0 Fix for a bug that could leave a raft node running when stopping a stream.
This can happen when we reset a stream internally and the stream had a prior snapshot.

Also make sure to always release resources back to the account regardless if the store is no longer present.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-01 13:22:06 -07:00
Derek Collison
546dd0c9ab Make sure we can recover an underlying node being stopped.
Do not return healthy if the node is closed, and wait a bit longer for forward progress.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 07:42:23 -07:00
Derek Collison
d107ba3549 Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
This will now allow the healthy call to attempt to restart those assets.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:11:08 -07:00
Derek Collison
7f06d6f5a7 When Jsz() was asked for consumer details, would report incorrect data if not a consumer leader.
This is due to the way state is maintained for leaders vs followers for consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-26 15:03:15 -07:00
Derek Collison
c0f5b71a8f Test that makes sure that assets that have been created under a certain cluster can be upgraded to a new cluster.
This is specifically when a cluster is reconfigured and the servers are restarted with a new cluster name.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 20:06:20 -07:00
Derek Collison
8b7c2d12aa Run a check for ack floor drift when taking over as a leader and the ack go routine is spun up.
Also periodically check. If all normal will be very cheap.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-21 11:59:35 -07:00
Derek Collison
7d3ec51d79 Fix for flapping test
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-03 14:46:59 -07:00
Ivan Kozlovic
a4df4f8727 Fixed some tests
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-03-30 15:02:59 -06:00
Derek Collison
4646f4af5d Do not allow any JetStream leaders to be placed on a lameduck server
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-29 20:15:41 -07:00
Derek Collison
02702e4620 [IMPROVEMENT] General stability and bug fixes. (#3999)
This PR has general improvements and fixes to filestore, raft, and the
clustering layer.

Summary

1. Additional support for preAck handling for interest based streams
when replicated acks arrive before the message itself.
2. Better handling when checking state to determine whether to remove an
interest based message.
3. Improved StepDown() and leadership transfer handling after restarts.
4. Improved voting logic for high load systems.
5. Various improvements and fixes for filestore Compact(), which is used
heavily in the raft layer when updating snapshots and the raft wal.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-29 17:09:44 -07:00
Derek Collison
182bf6cbae Bug fixes and general stability improvements.
1. If reset ignore Applied() that are greater then our commit.
2. Improved StepDown() by placing at back of queue if preferred.
3. Improved handling of leadership transfer during StepDown().
4. Do not store EntryLeaderTransfer records on disk.
5. Remove un-needed processing of older terms.
6. If append entry has higher term, also inherit pterm.
7. Only inherit a candidate's term if we decide to vote for them.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-29 12:43:46 -07:00
Neil Twigg
8d5519356e Shut down RAFT groups when disabling JetStream
Signed-off-by: Neil Twigg <neil@nats.io>
2023-03-23 16:54:01 +00:00
Derek Collison
9ccd7abdf8 Test for preAcks
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-21 12:08:24 -07:00
Derek Collison
5a16f98427 Fixed an off by one bug that under certain circumstances could cause large consumer replica states.
This could lead to instability in the system.

The bug would manifest in replicated consumers when certain messages could be acked out of order, and, the pending list would never go to zero.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-19 10:41:59 -07:00
Derek Collison
f0e1585490 Fix flapping test
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-17 13:14:43 -07:00
Derek Collison
5bb6f167b9 Make sure to cleanup messages on a follower consumer for an interest based stream when the consumer leader sends a state snapshot.
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-15 20:11:16 -07:00
Derek Collison
8dbfbbe577 Fix test
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-15 17:23:51 -07:00
Derek Collison
5a1878b015 Fix for workqueue stream scaling up and not removing acked messages.
Make sure when scaling up streams that are workqueue or interest policy that consumers scale as well.

Signed-off-by: Derek Collison <derek@nats.io>
2023-03-13 17:13:49 -07:00
Derek Collison
724160ebac Fix flapping tests
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-28 14:30:23 -08:00
Derek Collison
6078706544 Fixup test for new parameters
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-27 18:56:55 -08:00
Tomasz Pietrek
02ba78454d Fix new replicas late MaxAge expiry
This commit fixes the issue when scaling Stream with MaxAge
and some older messages stored. Until now, old messages were not properly
expired on new replicas, because new replicas first expiry timer
was set to MaxAge duration.
This commit adds a check if received messages expiry happens before
MaxAge, meaning they're messages older than the replica.
https://github.com/nats-io/nats-server/issues/3848

Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-02-24 00:46:02 +01:00
Neil Twigg
cfea34c80c Install snapshot and compact when WAL grows, even when no state changes occur 2023-02-22 20:00:57 +00:00
Tomasz Pietrek
337a9f2cbd Improve test for consumer with inactivity threshold
Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-02-19 17:57:09 +01:00
Derek Collison
06fd81d096 Fixed a bug where a named consumer under interest policy was spinning up inactive threshold timers in all replicas not just the leader.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-19 06:08:43 -08:00
Derek Collison
6a4c61e1a3 Merge branch 'main' into bad-consumer-delete 2023-02-18 11:09:56 -08:00
Derek Collison
01fa89a0b4 Fix for deleting consumers on restarts and non-fatal update errors.
If there was a spurious error on restart, or possibly on an update, we could delete a consumer which was the incorrect behavior.

Signed-off-by: Derek Collison <derek@nats.io>
2023-02-18 09:46:52 -08:00
Derek Collison
efa3bcc49d Parallel consumer creation could drop responses (create and info) and could also run monitorConsumer twice.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-18 05:16:05 -08:00
Waldemar Quevedo
4452f64d73 Fix TestJetStreamParallelConsumerCreation race
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-02-15 17:23:48 -08:00
Derek Collison
390fd02918 Updates to tests for updated Go client changes
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-31 09:47:36 -08:00
Derek Collison
f4e6481ce7 Allow report cycles between source streams if subjects truly form a cycle.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-27 13:03:24 -08:00
Derek Collison
c7a75c5a6d Merge pull request #3817 from nats-io/force-consumer-replicas
[FIXED] Force consumer replicas to match for interest policy streams
2023-01-26 09:39:15 -08:00
Derek Collison
3d78459ad1 Fixup for bad merge
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-26 09:09:30 -08:00
Neil Twigg
83932b4be6 Don't mark a clustered stream as unhealthy if making forward progress, add TestJetStreamClusterCurrentVsHealth 2023-01-26 16:57:34 +00:00
Derek Collison
d0a7a8169a Merge branch 'main' into force-consumer-replicas 2023-01-26 08:35:49 -08:00
Derek Collison
e15eb22ca6 When we create a consumer with less replicas then the stream, make sure to select from online peers.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-25 20:08:04 -08:00
Derek Collison
f62d929018 Consumer must match replica of parent stream if interest based policy.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-23 20:16:42 -08:00
Derek Collison
f4ee6530a0 When updating a stream to Direct Gets we were not spinning up subscription endpoint properly.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-23 16:51:07 -08:00
Ivan Kozlovic
79ca0c1787 Move test to "norace_test.go"
The test TestJetStreamClusterConsumerListPaging was in the
jetstream_cluster_3_test.go and because of `-race` flag would
take more than 440 seconds (7+ minutes) as seen here:

https://app.travis-ci.com/github/nats-io/nats-server/jobs/593984385#L335

Without the `-race` flag, this test takes ~17 seconds.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-01-23 17:05:18 -07:00
Derek Collison
f5d939ec24 Added test for #3636
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-05 10:56:52 -08:00
Derek Collison
6c5f0a669d Ensure we add in new consumers from a meta snapshot from the leader.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-04 22:18:31 -08:00
Neil Twigg
14d0ba1c65 Fix some lint errors after move to golangci-lint 2022-12-30 20:00:08 +00:00
Todd Beets
47c87eb71c fix and test for clustered mem store asset no-quorum if leader restarted 2022-12-14 16:16:08 -08:00
Derek Collison
dbc81b9c8b Merge pull request #3700 from mprimi/tests_temp_dir_cleanup
Temporary test files cleanup
2022-12-13 12:27:26 -08:00
Derek Collison
c2188a40ac Merge pull request #3709 from nats-io/zero-bug
Fix for regression in which peer re-assign to a former RG would zero state
2022-12-13 10:03:56 -08:00
Todd Beets
c0ca398b83 use jsz instead of struct direct in final state test 2022-12-12 20:00:14 -08:00
Marco Primi
f8a030bc4a Use testing.TempDir() where possible
Refactor tests to use go built-in temporary directory utility for tests.

Also avoid binding to default port (which may be in use)
2022-12-12 13:18:44 -08:00
Byron Ruth
566d1adfa7 Fix /healthz?js-enabled=true behavior
When js-enabled is set to true, the condition was only checked if
the `getJetStream()` call returned `nil`. However, if it non-nil,
all remaining checks were executed, including assessing the health
of the assets (streams and consumers).

This change addresses two issues:

- Switch to use `js.isEnabled()` which will check whether the value
  is nil OR `js.disabled = true` which can occur if the subsystem
  is temporarily disabled (insufficient resources).
- Correctly exit the check after the assertion and before meta and
  asset checks are performed.

In addition, the option has been renamed to `js-enabled-only` to align
with the `js-server-only` naming. The previous `js-enabled` name still
works, but is mapped to this new option. A warning is emitted noting
the previous option is deprecated.

Fix #3703

Signed-off-by: Byron Ruth <b@devel.io>
2022-12-10 07:34:32 -05:00