Commit Graph

399 Commits

Author SHA1 Message Date
Derek Collison
4fa0ea32c3 [FIXED] If a truncate for a raft WAL failed we could spin.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-25 19:07:27 -08:00
Derek Collison
ea2bfad8ea Fixed bug where snapshot would not compact through applied. This mean a subsequent request for exactly applied would return that entry only not the full state snapshot.
Fixed bug where we would not snapshot when we should.

Signed-off-by: Derek Collison <derek@nats.io>
2023-02-23 22:19:37 -08:00
Derek Collison
d347cb116a When becoming leader optionally send current snapshot to followers if caught up.
This can help sync on restarts and improve ghost ephemerals. Also added more code to suppress respnses and API audits when we know we are recovering.

Signed-off-by: Derek Collison <derek@nats.io>
2023-02-23 10:30:36 -08:00
Neil Twigg
cfea34c80c Install snapshot and compact when WAL grows, even when no state changes occur 2023-02-22 20:00:57 +00:00
Neil Twigg
68961ffedd Refactor ipQueue to use generics, reduce allocations 2023-02-21 14:50:09 +00:00
Derek Collison
3c64d07691 Warn of consumer state update failures.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-20 17:28:11 -08:00
Derek Collison
6a62ac4560 Fix for merge conflict
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-18 11:12:15 -08:00
Derek Collison
6a4c61e1a3 Merge branch 'main' into bad-consumer-delete 2023-02-18 11:09:56 -08:00
Derek Collison
01fa89a0b4 Fix for deleting consumers on restarts and non-fatal update errors.
If there was a spurious error on restart, or possibly on an update, we could delete a consumer which was the incorrect behavior.

Signed-off-by: Derek Collison <derek@nats.io>
2023-02-18 09:46:52 -08:00
Derek Collison
efa3bcc49d Parallel consumer creation could drop responses (create and info) and could also run monitorConsumer twice.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-18 05:16:05 -08:00
Derek Collison
6a2063f5b3 Revert logic
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-06 22:14:37 +04:00
Derek Collison
e9a983c802 Do not let !NeedSnapshot() avoid snapshots and compaction.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-01 22:05:25 -07:00
Derek Collison
e0798d26eb Merge pull request #3831 from nats-io/snapshots
Minor fixes and optimizations for snapshots.
2023-01-30 19:53:22 -08:00
Derek Collison
6058056e3b Minor fixes and optimizations for snapshots.
We were snappshotting more then needed, so double check that we should be doing this at the stream and consumer level.
At the raft level, we should have always been compacting the WAL to last+1, so made that consistent. Also fixed bug that would not skip last if more items behind the snapshot.

Signed-off-by: Derek Collison <derek@nats.io>
2023-01-30 17:54:18 -08:00
Waldemar Quevedo
13372508e2 Fix for isGroupLeaderless when JS not available (due to shutdown)
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-01-30 15:29:42 -08:00
Derek Collison
52a78c0352 Small optimizations.
1. Only snapshot with minSnap time window like consumers and meta. Make it consistent for all to 5s.
2. Only snapshot at the end of processing all entries pending vs inside the loop.
3. Use fast state when calculating sync request, do not need deleted details there.

Signed-off-by: Derek Collison <derek@nats.io>
2023-01-29 10:58:00 -08:00
Neil Twigg
83932b4be6 Don't mark a clustered stream as unhealthy if making forward progress, add TestJetStreamClusterCurrentVsHealth 2023-01-26 16:57:34 +00:00
Derek Collison
461aad17a5 Merge pull request #3820 from nats-io/issue-3791
[FIXED] Select consumer peer(s) from active peers only.
2023-01-26 08:27:11 -08:00
Derek Collison
e15eb22ca6 When we create a consumer with less replicas then the stream, make sure to select from online peers.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-25 20:08:04 -08:00
Derek Collison
a5cbd0b029 Fixed a bug that would not properly process updates on a stream on restart.
During restart if the stream existed but was also in a meta-snapshot delivered by the leader we would not process the update properly.

Signed-off-by: Derek Collison <derek@nats.io>
2023-01-25 18:16:33 -08:00
Neil Twigg
1baa1fbda8 Use highwayhash for last stream, consumer and cluster snapshots 2023-01-12 16:16:14 +00:00
Derek Collison
6c5f0a669d Ensure we add in new consumers from a meta snapshot from the leader.
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-04 22:18:31 -08:00
Neil Twigg
14d0ba1c65 Fix some lint errors after move to golangci-lint 2022-12-30 20:00:08 +00:00
Todd Beets
c463b398db Validate no overlapping stream subscriptions on update config (non-clustered jetstream) 2022-12-16 12:58:59 -08:00
Derek Collison
5f9a69e4f9 Make sure js is non-nil.
Signed-off-by: Derek Collison <derek@nats.io>
2022-12-13 16:37:00 -08:00
Derek Collison
fa67c50bec Make sure we clear the old raft node from our stream assignment.
This would not allow a re-assignment of a peer to work correctly.

Signed-off-by: Derek Collison <derek@nats.io>
2022-12-12 12:51:08 -05:00
Derek Collison
2f27438230 Make stream removal from a server consistent.
Signed-off-by: Derek Collison <derek@nats.io>
2022-12-06 17:11:43 -08:00
Todd Beets
3fdfb8a12f Merge branch 'main' into ut-replacepeer
# Conflicts:
#	server/jetstream_cluster_3_test.go
2022-12-06 10:51:22 -08:00
Todd Beets
ef27d4d534 tag policies not honored in reassignment after peer remove 2022-12-04 20:39:11 -08:00
Derek Collison
5f7c8e21a2 Fixed issues with multiple concurrent stream create requests.
First issue was applications not getting any response.
However, there was also a more serious issue that would create multiple raft groups for each concurrent request.
The servers would only run one stream monitor loop, however they would update the state to the new raft group's name, so on server restart the stream would be using a different raft group then existing servers.

Signed-off-by: Derek Collison <derek@nats.io>
2022-12-04 19:13:51 -08:00
Derek Collison
36ef788112 When determing whether we need an ack, no need to copy since under consumer lock.
Signed-off-by: Derek Collison <derek@nats.io>
2022-11-14 11:47:31 -08:00
Ivan Kozlovic
304744ce08 Merge pull request #3615 from nats-io/js_acc_max_streams_consumers
[FIXED] JetStream: Account max streams/consumers not always honoured
2022-11-09 18:02:51 -07:00
Ivan Kozlovic
1b892837cb [FIXED] JetStream: Account max streams/consumers not always honoured
This could happen during concurrent requests where the assignments
are not yet fully processed.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-11-09 17:29:20 -07:00
Derek Collison
e008e015b3 Make sure to enforce HA asset limits during peer processing as well as assignment.
Signed-off-by: Derek Collison <derek@nats.io>
2022-11-09 16:24:54 -08:00
Ivan Kozlovic
ca237bdfa0 [FIXED] JetStream: Stream scale down while it has no quorum
If a stream R2 had one of its server network-partitioned and at
that time the stream was edited to be scaled down to an R1 it
would cause the stream to no longer have quorum even when the
network partition is resolved.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-11-04 09:08:31 -06:00
Derek Collison
56919ebc97 On stream proposal failures we could accidentally warn on high stream lag.
We were not taking the clfs into account.

Signed-off-by: Derek Collison <derek@nats.io>
2022-11-02 14:40:31 -07:00
Ivan Kozlovic
ab4470ccdc [FIXED] JetStream: possible panic on some rare cases
Very difficult to reproduce. Had to run TestJetStreamSuperClusterMoveCancel
in covermode=atomic on a slow machine to hit the condition where
the monitorConsumer go routine is started by RAFT node is nil,
which caused the warning message to produce the panic (since n is nil)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-11-02 10:02:09 -06:00
Ivan Kozlovic
55e651c118 [FIXED] JetStream: processing of snapshot with expired messages
The issue that a "first sequence mismatch" during processing of
a snapshot was causing the state to be reset and caused a lot
of catchup from the follower. An attempt to fix that in PR #3567
caused an issue that was addressed in PR #3589. However, this was
then causing the follower to sometime never able to catchup or
took a very long time.
This PR - we believe - addresses the original and subsequent issues.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-11-01 12:58:45 -06:00
Derek Collison
121bf6ebb5 Move to past check for nil
Signed-off-by: Derek Collison <derek@nats.io>
2022-10-27 17:30:07 -07:00
Derek Collison
2241ad089e Make local error since non-fatal for now.
Signed-off-by: Derek Collison <derek@nats.io>
2022-10-25 16:56:10 -07:00
Derek Collison
aa52c2fecf Added warning for high message lag into a clustered stream.
Signed-off-by: Derek Collison <derek@nats.io>
2022-10-25 16:11:35 -07:00
Derek Collison
db13766f18 Merge pull request #3576 from nats-io/signal-pull-consumers
Removed ephemeral consumer migration.
2022-10-25 17:35:35 -05:00
Derek Collison
f0afa49b9f Make sure to stop raft nodes on all monitor exits.
Signed-off-by: Derek Collison <derek@nats.io>
2022-10-25 14:48:28 -07:00
Derek Collison
ff2cd1d7f9 Fixed test and bug that would override consumer replicas.
Signed-off-by: Derek Collison <derek@nats.io>
2022-10-25 14:35:20 -07:00
Ivan Kozlovic
7ca85e0e80 [FIXED] JetStream: Update of an R1 consumer would not get a response
The update was accepted but the server would not respond to the
client/CLI.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-25 09:04:35 -06:00
Ivan Kozlovic
f8aa3ac11d [FIXED] JetStream: "first sequence mismatch" error on catchup with message expiration
When a server was restarted and expired messages, but the leader had a snapshot that
still had the old messages we would reset complete follower stream state, this fix
just skips over the expired as we prepare the request to the leader.

Resolves #3516

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-17 17:02:08 -06:00
Ivan Kozlovic
9bd11580e3 [FIXED] JetStream: User-defined ephemeral Name not used in cluster mode
If the user sends a CONSUMER.CREATE request with a configuration that
specifies the name that the user wants for the ephemeral consumer,
this would not work on cluster mode, that is, the server would still
pick a name instead of using the provided one.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-10 13:48:38 -06:00
Ivan Kozlovic
3472f6aec2 [FIXED] JetStream: unresponsiveness while creating raft group
Originally, createRaftGroup() would not hold the jetstream's lock
for the whole duration. But some race reports made us change
this function to keep the lock for the whole duration. A test
called TestJetStreamClusterRaceOnRAFTCreate() was demonstrating
the race between "consumer info" request handling and createRaftGroup
code. Since then, the race has been fixed, so this PR restores
the more fine-grained locking inside createRaftGroup.

Resolves #3516

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-04 17:27:36 -06:00
Derek Collison
52b5cd12bb Allow meta layer to snapshot on a clean shutdown.
Signed-off-by: Derek Collison <derek@nats.io>
2022-09-29 09:17:12 -06:00
Ivan Kozlovic
e151cfcd57 [FIXED] JetStream: Scale down of consumer to R1 would not get a response
Updating a consumer configuration from say R3 to R1 would work
but no response was received by the client sending the request.

Resolves #3493

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-27 10:02:31 -06:00