This can help sync on restarts and improve ghost ephemerals. Also added more code to suppress respnses and API audits when we know we are recovering.
Signed-off-by: Derek Collison <derek@nats.io>
If there was a spurious error on restart, or possibly on an update, we could delete a consumer which was the incorrect behavior.
Signed-off-by: Derek Collison <derek@nats.io>
We were snappshotting more then needed, so double check that we should be doing this at the stream and consumer level.
At the raft level, we should have always been compacting the WAL to last+1, so made that consistent. Also fixed bug that would not skip last if more items behind the snapshot.
Signed-off-by: Derek Collison <derek@nats.io>
1. Only snapshot with minSnap time window like consumers and meta. Make it consistent for all to 5s.
2. Only snapshot at the end of processing all entries pending vs inside the loop.
3. Use fast state when calculating sync request, do not need deleted details there.
Signed-off-by: Derek Collison <derek@nats.io>
During restart if the stream existed but was also in a meta-snapshot delivered by the leader we would not process the update properly.
Signed-off-by: Derek Collison <derek@nats.io>
First issue was applications not getting any response.
However, there was also a more serious issue that would create multiple raft groups for each concurrent request.
The servers would only run one stream monitor loop, however they would update the state to the new raft group's name, so on server restart the stream would be using a different raft group then existing servers.
Signed-off-by: Derek Collison <derek@nats.io>
If a stream R2 had one of its server network-partitioned and at
that time the stream was edited to be scaled down to an R1 it
would cause the stream to no longer have quorum even when the
network partition is resolved.
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Very difficult to reproduce. Had to run TestJetStreamSuperClusterMoveCancel
in covermode=atomic on a slow machine to hit the condition where
the monitorConsumer go routine is started by RAFT node is nil,
which caused the warning message to produce the panic (since n is nil)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The issue that a "first sequence mismatch" during processing of
a snapshot was causing the state to be reset and caused a lot
of catchup from the follower. An attempt to fix that in PR #3567
caused an issue that was addressed in PR #3589. However, this was
then causing the follower to sometime never able to catchup or
took a very long time.
This PR - we believe - addresses the original and subsequent issues.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>