Currently `UpdateKnownPeers` doesn't send a peer state when a single
peer add operation is taking place, but it seems like this can
potentially race when there are lots of changes to the replica count
happening in rapid succession. Sending the peer state in all cases seems
to fix this issue and, so far in my testing, fixes the failground stream
update replicas test.
Signed-off-by: Neil Twigg <neil@nats.io>
When consumers were R1 and the same name was reused, server restarts
could try to cleanup old ones and effect the new ones. These changes
allow consumer name reuse more effectively during server restarts.
Signed-off-by: Derek Collison <derek@nats.io>
- [ ] Link to issue, e.g. `Resolves #NNN`
- [ ] Documentation added (if applicable)
- [ ] Tests added
- [ ] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [ ] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
- [x] Build is green in Travis CI
- [x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)
Resolves panics in the code.
### Changes proposed in this pull request:
- This PR fixes some of the panics in the code
PR https://github.com/nats-io/nats-server/pull/4212 fixed the issue I
reported in https://github.com/nats-io/nats-server/issues/4196.
However, I believe there might be a bug when both `sequence` and `keep`
are set during recovery.
In the `PurgeEx` the following check is done (for both `filestore.go`
and `memstore.go`):
```go
if sequence > 1 && keep > 0 {
return 0, ErrPurgeArgMismatch
}
```
The `TestJetStreamClusterPurgeExReplayAfterRestart` also triggers this
case, meaning that during the test this error is returned but it
succeeds because the purge was already performed. Is this intended
behaviour?
To elaborate a bit more, I believe the following happens:
- when running the purge normally it will properly run the `keep` (since
it's not combined with `sequence` yet)
- when replaying the purge though, the `sequence` is added to the
`keep`, which errors out in the above if
Which means that during normal operation all will be well, but purges
with `keep` will be ignored upon replaying.
I'm proposing to remove the `sequence > 1 && keep > 0` check and
subsequent error. Which, for reference, was introduced in
https://github.com/nats-io/nats-server/pull/3121.
Hoping this ensures that during recovery, purges that haven't executed
yet will still be executed.
An alternative approach, which wouldn't remove the error: not allow
combining `sequence` and `keep` normally and only allowing it during
recovery. Which would preserve the current behaviour, and correctly
apply `sequence+keep` during recovery still. However, not sure if it's
possible to know if we're in "recovery mode" from within `PurgeEx`.
Resolves https://github.com/nats-io/nats-server/issues/4196
This is an extension to the excellent work by @MauriceVanVeen and his
original PR #4197 to fully resolve for all use cases.
Signed-off-by: Derek Collison <derek@nats.io>
Resolves#4196
When a server was killed on restart before an encrypted stream was
recovered the keyfile was removed and could cause the stream to not be
recoverable.
We only needed to delete the key file when converting ciphers and right
before we add the stream itself.
Signed-off-by: Derek Collison <derek@nats.io>
Resolves#4195
When we were optimizing for single cluster and large numbers of
leafnodes we inadvertently broke a daisy chained scenario where a server
was a spoke and a hub within a single hub server.
So interest on D would not propagate properly to server A as a
publisher.
```
B
/ \
A C -- D (SUB)
|
PUB
```
This adds a workflow to mark issues and PRs stale after the configured
period of time, followed by closing the issue/PR after a subsequent
period of time if there was no additional activity.
The `debug-only` option is so currently, so even when merged, it will do
a dry-run and not perform any actions. Once we inspect the initial logs
of the effect of an initial run (impacting existing issues), we can
adjust accordingly and then follow-up with making it active.
For the debug logs to be enabled, we do need to add a repository secret
named `ACTIONS_STEP_DEBUG` with a value set to `true` per [this
instruction](https://github.com/marketplace/actions/close-stale-issues#debugging).
1. When catching up do not try forever and if needed reset cluster state.
2. In checking if a stream is healthy check for node drift.
3. When restarting a stream make sure the current node is stopped.
Signed-off-by: Derek Collison <derek@nats.io>
Make sure to wait properly until we believe we are caught up to enable
direct gets on followers.
Signed-off-by: Derek Collison <derek@nats.io>
Resolves#4162