This updates the `panic` messages on applying meta entries to include
the faulty op type, so that we can better work out what's going on in
these cases.
Signed-off-by: Neil Twigg <neil@nats.io>
Consumer state from Jsz() would not be consistent for a leader vs
follower.
ConsumerFileStore could encode an empty state or update an empty state
on startup.
We needed to make sure at the lowest level that the state was read from
disk and not depend on the upper layer consumer.
Signed-off-by: Derek Collison <derek@nats.io>
We needed to make sure at the lowest level that the state was read from disk and not depend on upper layer consumer.
Signed-off-by: Derek Collison <derek@nats.io>
I noticed that stream migration could be delayed due to transferring
leadership while the new leader was still paused for a upper layer
catchup, resulting in downgrading to a normal lost quorum vote. This
allows a leadership transfer to move ahead once the upper layer resumes.
Also check quicker but slow down if the state we need to have is not
there yet.
Signed-off-by: Derek Collison <derek@nats.io>
This PR effectively reverts part of #4084 which removed the coalescing
from the outbound queues as I initially thought it was the source of a
race condition.
Further investigation has proven that not only was that untrue (the race
actually came from the WebSocket code, all coalescing operations happen
under the client lock) but removing the coalescing also worsens
performance.
Signed-off-by: Neil Twigg <neil@nats.io>
This is specifically when a cluster is reconfigured and the servers are
restarted with a new cluster name.
Signed-off-by: Derek Collison <derek@nats.io>
Originally I thought there was a race condition happening here,
but it turns out it is safe after all and the race condition I
was seeing was due to other problems in the WebSocket code.
Signed-off-by: Neil Twigg <neil@nats.io>
In single server mode healthz could mistake a snapshot staging
direct…ory during a restore as an account.
If the restore took a long time, stalled, or was aborted, would cause
healthz to fail.
Signed-off-by: Derek Collison <derek@nats.io>
This is specifically when a cluster is reconfigured and the servers are restarted with a new cluster name.
Signed-off-by: Derek Collison <derek@nats.io>
This extends the previous work in #3733 with the following:
1. Remove buffer coalescing, as this could result in a race condition
during the `writev` syscall in rare circumstances
2. Add a third buffer size, to ensure that we aren't allocating more
than we need to without coalescing
3. Refactor buffer handling in the WebSocket code to reduce allocations
and ensure owned buffers aren't incorrectly being pooled resulting in
further race conditions
Fixesnats-io/nats.ws#194.
Signed-off-by: Neil Twigg <neil@nats.io>
A stream could have a complicated state with interior deletes.
This is a simpler way to determine if we need to consider a snapshot
that involves much less time and CPU and memory.
Signed-off-by: Derek Collison <derek@nats.io>
This is a simpler way to determine if we need to consider a snapshot that involves much less time and CPU and memory.
Signed-off-by: Derek Collison <derek@nats.io>
Reset our WAL on edge conditions instead of trying to recover.
Also if we are timing out and trying to become a candidate but are doing
a catchup check if we are stalled.
Signed-off-by: Derek Collison <derek@nats.io>
Also if we are timing out and trying to become a candidate but are doing a catchup check if we are stalled.
Signed-off-by: Derek Collison <derek@nats.io>
When tracking information per subject went from >1 back to only 1 we
needed to make sure we cleared firstNeedsUpdate.
Thanks to @scottf for finding it.
Signed-off-by: Derek Collison <derek@nats.io>