Commit Graph

7220 Commits

Author SHA1 Message Date
Derek Collison
4ac45ff6f3 When consumers were R1 and the same name was reused, server restarts could try to cleanup old ones and effect the new ones.
These changes allow consumer name reuse more effectively during server restarts.

Signed-off-by: Derek Collison <derek@nats.io>
2023-06-05 12:48:18 -07:00
Derek Collison
64e3bf82ed Fix PurgeEx replay with sequence & keep succeeds (#4213)
PR https://github.com/nats-io/nats-server/pull/4212 fixed the issue I
reported in https://github.com/nats-io/nats-server/issues/4196.

However, I believe there might be a bug when both `sequence` and `keep`
are set during recovery.
In the `PurgeEx` the following check is done (for both `filestore.go`
and `memstore.go`):
```go
	if sequence > 1 && keep > 0 {
		return 0, ErrPurgeArgMismatch
	}
```

The `TestJetStreamClusterPurgeExReplayAfterRestart` also triggers this
case, meaning that during the test this error is returned but it
succeeds because the purge was already performed. Is this intended
behaviour?

To elaborate a bit more, I believe the following happens:
- when running the purge normally it will properly run the `keep` (since
it's not combined with `sequence` yet)
- when replaying the purge though, the `sequence` is added to the
`keep`, which errors out in the above if

Which means that during normal operation all will be well, but purges
with `keep` will be ignored upon replaying.

I'm proposing to remove the `sequence > 1 && keep > 0` check and
subsequent error. Which, for reference, was introduced in
https://github.com/nats-io/nats-server/pull/3121.
Hoping this ensures that during recovery, purges that haven't executed
yet will still be executed.

An alternative approach, which wouldn't remove the error: not allow
combining `sequence` and `keep` normally and only allowing it during
recovery. Which would preserve the current behaviour, and correctly
apply `sequence+keep` during recovery still. However, not sure if it's
possible to know if we're in "recovery mode" from within `PurgeEx`.

Resolves https://github.com/nats-io/nats-server/issues/4196
2023-06-04 13:29:53 -07:00
Maurice van Veen
132567de39 Fix PurgeEx replay with sequence & keep succeeds 2023-06-04 11:56:28 +02:00
Derek Collison
e1f8064e9e [FIXED] Make sure to process extended purge operations correctly when being replayed. (#4212)
This is an extension to the excellent work by @MauriceVanVeen and his
original PR #4197 to fully resolve for all use cases.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4196
2023-06-03 18:12:22 -07:00
Derek Collison
dee532495d Make sure to process extended purge operations correctly when being replayed on a restart.
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 17:49:45 -07:00
Derek Collison
eb09ddd73a [FIXED] Killed server on restart could render encrypted stream unrecoverable (#4210)
When a server was killed on restart before an encrypted stream was
recovered the keyfile was removed and could cause the stream to not be
recoverable.

We only needed to delete the key file when converting ciphers and right
before we add the stream itself.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4195
2023-06-03 17:36:10 -07:00
Derek Collison
449b429b58 [FIXED] Data races detected in internal testing (#4211)
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 16:20:56 -07:00
Derek Collison
238282d974 Fix some data races detected in internal testing
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 13:58:15 -07:00
Derek Collison
4c1b93d023 Make sure to put the keyfile back if we did not recover the stream.
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 11:21:58 -07:00
Derek Collison
d5ae96f54d When a server was killed on restart before an encrypted stream was recovered the keyfile was removed and could cause the stream to not be recoverable.
We only needed to delete the key file when converting ciphers and right before we add the stream itself.

Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 11:21:47 -07:00
Derek Collison
22c97d67ff [FIXED] Daisy chained leafnodes sometimes would not propagate interest (#4207)
When we were optimizing for single cluster and large numbers of
leafnodes we inadvertently broke a daisy chained scenario where a server
was a spoke and a hub within a single hub server.

So interest on D would not propagate properly to server A as a
publisher.

```
     B
   /    \
A       C -- D (SUB)
 |
PUB
```
2023-06-02 16:43:21 -07:00
Derek Collison
b2ac621212 Bump to 2.9.18-beta (#4182) 2023-06-02 16:40:23 -07:00
Derek Collison
1bce79750e When we were optimizing for single cluster but large number of leafnodes we inadvertently broke a daisy chained scenarion where a server was a spoke and a hub with a single hub cluster.
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-02 15:16:36 -07:00
Derek Collison
25ad3cd4af Only check ack floor if we are interest policy based. (#4206)
Saw performance issue with a user a limits based stream with large
number of consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2023-06-02 12:43:06 -07:00
Derek Collison
27bbfb7a85 Only check ack floor if we are interest policy based.
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-02 11:04:00 -07:00
Byron Ruth
b24f0f393a Bump to 2.9.18-beta
Signed-off-by: Byron Ruth <byron@nats.io>
2023-05-18 14:22:22 -04:00
Waldemar Quevedo
4f2c9a5184 Prepare v2.9.17 release (#4181)
Include fix with GoReleaser for nightly.
2023-05-18 11:19:34 -07:00
Byron Ruth
f3dac91d2a Prepare v2.9.17 release
Include fix with GoReleaser for nightly.

Signed-off-by: Byron Ruth <byron@nats.io>
2023-05-18 13:57:40 -04:00
Derek Collison
25d9762ce2 [IMPROVED] Make health checks more consistent with stream health checks. (#4180)
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-18 09:18:12 -07:00
Derek Collison
7e3f3f4908 Make health checks more consistent with stream health checks.
Check for closed state on leader change for consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-18 08:18:53 -07:00
Derek Collison
f63d63fbce [IMPROVED] Stepdown on catchup request for something newer than our state (#4179)
When we receive a catchup request for an item beyond our current state,
we should stepdown.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 19:25:05 -07:00
Derek Collison
4fbc0ee563 Update to Go 1.19.9 (#4178) 2023-05-17 18:01:58 -07:00
Byron Ruth
3a152a0e40 Update to Go 1.19.9
Signed-off-by: Byron Ruth <byron@nats.io>
2023-05-17 20:57:10 -04:00
Derek Collison
8e825001d2 When we receive a catchup request for an item beyond our current state, we should stepdown.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 17:30:35 -07:00
Derek Collison
7dfe5e528e Bump to 2.9.17-RC.3
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:46:10 -07:00
Derek Collison
93eaf8c814 Add workflow for stale issues (#4161)
This adds a workflow to mark issues and PRs stale after the configured
period of time, followed by closing the issue/PR after a subsequent
period of time if there was no additional activity.

The `debug-only` option is so currently, so even when merged, it will do
a dry-run and not perform any actions. Once we inspect the initial logs
of the effect of an initial run (impacting existing issues), we can
adjust accordingly and then follow-up with making it active.

For the debug logs to be enabled, we do need to add a repository secret
named `ACTIONS_STEP_DEBUG` with a value set to `true` per [this
instruction](https://github.com/marketplace/actions/close-stale-issues#debugging).
2023-05-17 16:45:17 -07:00
Derek Collison
94457e2d55 [IMPROVED] Reset logic for streams (#4177)
When we detect conditions to reset streams, make sure we properly clean
up old NRG nodes etc.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:45:00 -07:00
Derek Collison
b856bba285 [FIXED] Avoid deadlock with usage lock for an account during checkAndSyncUsage() (#4176)
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:44:44 -07:00
Derek Collison
a8d7d3886e Make sure to delete the stream assignment node here
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:19:39 -07:00
Derek Collison
44a5875968 Avoimd deadlock with usage lock for an account during checkAndSyncUsage().
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:05:46 -07:00
Derek Collison
f3553791b1 Updates to stream reset logic.
1. When catching up do not try forever and if needed reset cluster state.
2. In checking if a stream is healthy check for node drift.
3. When restarting a stream make sure the current node is stopped.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 13:14:33 -07:00
Derek Collison
5db57fb053 Bump to 2.9.17-RC.2
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 14:02:29 -07:00
Derek Collison
ac68a19530 [IMPROVED] Restart consumer behavior during healthz() checks. (#4172)
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 13:58:47 -07:00
Derek Collison
a06e1c9b43 Make sure to also stop nodes when dealing with consumer after stream restart
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 13:16:47 -07:00
Derek Collison
3752a6c500 Make sure to stop the node on a consumer restart if still running
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 12:49:46 -07:00
Derek Collison
734895ae47 Fix test flapper
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 12:20:18 -07:00
Derek Collison
87f17fcff4 [FIXED] Avoid stale KV reads on server restart for replicated KV stores. (#4171)
Make sure to wait properly until we believe we are caught up to enable
direct gets on followers.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4162
2023-05-16 11:29:37 -07:00
Derek Collison
b0340ce598 Make sure to wait properly until we believe we are caught up to enable direct gets.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 11:02:06 -07:00
Derek Collison
4feb7b95a3 CHANGED - typo err (#4169)
Resolves # typo err

### Changes proposed in this pull request:

 - jetstream_api.go typo err
 - accounts.go typo err
 
/cc @nats-io/core
2023-05-16 07:20:19 -07:00
Savion
cd192f0e03 CHANGED - typo err 2023-05-16 16:41:52 +08:00
Derek Collison
bca7b4ea44 Bump to 2.9.17-RC.1
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-15 15:45:48 -07:00
Derek Collison
9434110c05 [FIXED] Additional fix for #3734. (#4166)
When the first block was truncated and missing any index info we would
not properly rebuild the state.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #3734
2023-05-15 15:40:12 -07:00
Waldemar Quevedo
ee38f8bbc5 monitor: change account detail info back to utc when served (#4163)
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-05-15 15:33:57 -07:00
Derek Collison
3602ff5146 Additional fix for #3734.
When the first block was truncated and missing any index info we would not properly rebuild the state.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-15 15:30:55 -07:00
Derek Collison
584ea85d75 [FIXED] Protect against out of bounds access on usage updates. (#4164)
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-15 14:58:05 -07:00
Derek Collison
832df1cdba Protect against out of bounds access on usage updates.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-15 14:38:26 -07:00
Derek Collison
fe71ef524c [FIXED] Service imports reporting for Accountz() when mapping to local subjects. (#4158)
Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4144
2023-05-15 14:04:57 -07:00
Derek Collison
ea75beaeb1 [FIXED] Track all remote servers in a NATS system with different domains. (#4159)
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-15 13:47:06 -07:00
Byron Ruth
f764853e00 Disable auto-closing, increase stale threshold
Signed-off-by: Byron Ruth <byron@nats.io>
2023-05-15 09:14:31 -04:00
Waldemar Quevedo
3c4ed549a5 resolver: improve signaling for missing account lookups (#4151)
When using the nats account resolver and a JWT is not found, the client could
often get an i/o timeout error due to not receiving a timely response
before the account resolver fetch request times out. Now instead
of waiting for the fetch request to timeout, a resolver without JWTs
will notify as well that it could not find a matching JWT, waiting for a
response from all active servers.

Also included in this PR is some cleanup to the logs emitted by the
resolver.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-05-14 11:10:25 -07:00