This can help sync on restarts and improve ghost ephemerals. Also added more code to suppress respnses and API audits when we know we are recovering.
Signed-off-by: Derek Collison <derek@nats.io>
If there was a spurious error on restart, or possibly on an update, we could delete a consumer which was the incorrect behavior.
Signed-off-by: Derek Collison <derek@nats.io>
We were snappshotting more then needed, so double check that we should be doing this at the stream and consumer level.
At the raft level, we should have always been compacting the WAL to last+1, so made that consistent. Also fixed bug that would not skip last if more items behind the snapshot.
Signed-off-by: Derek Collison <derek@nats.io>
1. Only snapshot with minSnap time window like consumers and meta. Make it consistent for all to 5s.
2. Only snapshot at the end of processing all entries pending vs inside the loop.
3. Use fast state when calculating sync request, do not need deleted details there.
Signed-off-by: Derek Collison <derek@nats.io>
During restart if the stream existed but was also in a meta-snapshot delivered by the leader we would not process the update properly.
Signed-off-by: Derek Collison <derek@nats.io>
First issue was applications not getting any response.
However, there was also a more serious issue that would create multiple raft groups for each concurrent request.
The servers would only run one stream monitor loop, however they would update the state to the new raft group's name, so on server restart the stream would be using a different raft group then existing servers.
Signed-off-by: Derek Collison <derek@nats.io>
If a stream R2 had one of its server network-partitioned and at
that time the stream was edited to be scaled down to an R1 it
would cause the stream to no longer have quorum even when the
network partition is resolved.
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Very difficult to reproduce. Had to run TestJetStreamSuperClusterMoveCancel
in covermode=atomic on a slow machine to hit the condition where
the monitorConsumer go routine is started by RAFT node is nil,
which caused the warning message to produce the panic (since n is nil)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The issue that a "first sequence mismatch" during processing of
a snapshot was causing the state to be reset and caused a lot
of catchup from the follower. An attempt to fix that in PR #3567
caused an issue that was addressed in PR #3589. However, this was
then causing the follower to sometime never able to catchup or
took a very long time.
This PR - we believe - addresses the original and subsequent issues.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When a server was restarted and expired messages, but the leader had a snapshot that
still had the old messages we would reset complete follower stream state, this fix
just skips over the expired as we prepare the request to the leader.
Resolves#3516
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
If the user sends a CONSUMER.CREATE request with a configuration that
specifies the name that the user wants for the ephemeral consumer,
this would not work on cluster mode, that is, the server would still
pick a name instead of using the provided one.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Originally, createRaftGroup() would not hold the jetstream's lock
for the whole duration. But some race reports made us change
this function to keep the lock for the whole duration. A test
called TestJetStreamClusterRaceOnRAFTCreate() was demonstrating
the race between "consumer info" request handling and createRaftGroup
code. Since then, the race has been fixed, so this PR restores
the more fine-grained locking inside createRaftGroup.
Resolves#3516
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Updating a consumer configuration from say R3 to R1 would work
but no response was received by the client sending the request.
Resolves#3493
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>