We were snappshotting more then needed, so double check that we should be doing this at the stream and consumer level.
At the raft level, we should have always been compacting the WAL to last+1, so made that consistent. Also fixed bug that would not skip last if more items behind the snapshot.
Signed-off-by: Derek Collison <derek@nats.io>
1. Only snapshot with minSnap time window like consumers and meta. Make it consistent for all to 5s.
2. Only snapshot at the end of processing all entries pending vs inside the loop.
3. Use fast state when calculating sync request, do not need deleted details there.
Signed-off-by: Derek Collison <derek@nats.io>
During restart if the stream existed but was also in a meta-snapshot delivered by the leader we would not process the update properly.
Signed-off-by: Derek Collison <derek@nats.io>
First issue was applications not getting any response.
However, there was also a more serious issue that would create multiple raft groups for each concurrent request.
The servers would only run one stream monitor loop, however they would update the state to the new raft group's name, so on server restart the stream would be using a different raft group then existing servers.
Signed-off-by: Derek Collison <derek@nats.io>
If a stream R2 had one of its server network-partitioned and at
that time the stream was edited to be scaled down to an R1 it
would cause the stream to no longer have quorum even when the
network partition is resolved.
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Very difficult to reproduce. Had to run TestJetStreamSuperClusterMoveCancel
in covermode=atomic on a slow machine to hit the condition where
the monitorConsumer go routine is started by RAFT node is nil,
which caused the warning message to produce the panic (since n is nil)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The issue that a "first sequence mismatch" during processing of
a snapshot was causing the state to be reset and caused a lot
of catchup from the follower. An attempt to fix that in PR #3567
caused an issue that was addressed in PR #3589. However, this was
then causing the follower to sometime never able to catchup or
took a very long time.
This PR - we believe - addresses the original and subsequent issues.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When a server was restarted and expired messages, but the leader had a snapshot that
still had the old messages we would reset complete follower stream state, this fix
just skips over the expired as we prepare the request to the leader.
Resolves#3516
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
If the user sends a CONSUMER.CREATE request with a configuration that
specifies the name that the user wants for the ephemeral consumer,
this would not work on cluster mode, that is, the server would still
pick a name instead of using the provided one.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Originally, createRaftGroup() would not hold the jetstream's lock
for the whole duration. But some race reports made us change
this function to keep the lock for the whole duration. A test
called TestJetStreamClusterRaceOnRAFTCreate() was demonstrating
the race between "consumer info" request handling and createRaftGroup
code. Since then, the race has been fixed, so this PR restores
the more fine-grained locking inside createRaftGroup.
Resolves#3516
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Updating a consumer configuration from say R3 to R1 would work
but no response was received by the client sending the request.
Resolves#3493
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
A request to `$SYS.REQ.SERVER.PING.JSZ` would now return something
like this:
```
...
"meta_cluster": {
"name": "local",
"leader": "A",
"peer": "NUmM6cRx",
"replicas": [
{
"name": "B",
"current": true,
"active": 690369000,
"peer": "b2oh2L6w"
},
{
"name": "Server name unknown at this time (peerID: jZ6RvVRH)",
"current": false,
"offline": true,
"active": 0,
"peer": "jZ6RvVRH"
}
],
"cluster_size": 3
}
```
Note the "peer" field following the "leader" field that contains
the server name. The new field is the node ID, which is a hash of
the server name.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This was discovered by new test TestJetStreamClusterRemovePeerByID.
I saw this on Travis and repeating the test locally with -count=10
I was able to reproduce. The issue is cc.meta being nil but accessing
cc.meta.ID() directly.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The CLI will now be able to display the peer IDs in MetaGroupInfo
if it choses to do so, and possibly help user select the peer ID
from a list with a new command to remove by peer ID instead of
by server name.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
* changed format of JSClusterNoPeers error
This error was introduced in #3342 and reveals to much information
This change gets rid of cluster names and peer counts.
All other counts where changed to booleans,
which are only included in the output when the filter was hit.
In addition, the set of not matching tags is included.
Furthermore, the static error description in server/errors.json
is moved into selectPeerError
sample errors:
1) no suitable peers for placement, tags not matched ['cloud:GCP', 'country:US']"
2) no suitable peers for placement, insufficient storage
Signed-off-by: Matthias Hanel <mh@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Co-authored-by: Ivan Kozlovic <ivan@synadia.com>