Commit Graph

478 Commits

Author SHA1 Message Date
Derek Collison
f342f6a758 Merge branch 'main' into dev 2023-06-05 14:13:18 -07:00
Derek Collison
2e2ac33920 [IMPROVED] When R1 consumers were recreated with the same name when they became inactive. (#4216)
When consumers were R1 and the same name was reused, server restarts
could try to cleanup old ones and effect the new ones. These changes
allow consumer name reuse more effectively during server restarts.

Signed-off-by: Derek Collison <derek@nats.io>
2023-06-05 14:04:53 -07:00
Derek Collison
df5df3ce99 Panic fixes (#4214)
- [ ] Link to issue, e.g. `Resolves #NNN`
 - [ ] Documentation added (if applicable)
 - [ ] Tests added
- [ ] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [ ] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
 - [x] Build is green in Travis CI
- [x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)

Resolves panics in the code.

### Changes proposed in this pull request:

 - This PR fixes some of the panics in the code
2023-06-05 13:02:05 -07:00
Derek Collison
4ac45ff6f3 When consumers were R1 and the same name was reused, server restarts could try to cleanup old ones and effect the new ones.
These changes allow consumer name reuse more effectively during server restarts.

Signed-off-by: Derek Collison <derek@nats.io>
2023-06-05 12:48:18 -07:00
Derek Collison
30d9dfd305 Merge branch 'main' into dev 2023-06-03 18:17:28 -07:00
Derek Collison
dee532495d Make sure to process extended purge operations correctly when being replayed on a restart.
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 17:49:45 -07:00
Derek Collison
238282d974 Fix some data races detected in internal testing
Signed-off-by: Derek Collison <derek@nats.io>
2023-06-03 13:58:15 -07:00
Artem Seleznev
27a8b96ee3 different panic fixes
Signed-off-by: Artem Seleznev <seleznyov.artyom@gmail.com>
2023-06-02 13:19:22 +03:00
R.I.Pienaar
c24547eb4e Record the stream and consumer info timestamps (#4133)
This records the server time when info for streams and consumers are
created so that tools such as the nats cli can calculate time deltas for
last ack, last delivered and so forth in the context of the server
clock.

This will help aleviate problems with client devices experiencing clock
jitter that can show up in user interfaces as negative seconds since
last ack etc
2023-06-02 08:53:28 +03:00
Derek Collison
7e3f3f4908 Make health checks more consistent with stream health checks.
Check for closed state on leader change for consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-18 08:18:53 -07:00
Derek Collison
a8d7d3886e Make sure to delete the stream assignment node here
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 16:19:39 -07:00
Derek Collison
f3553791b1 Updates to stream reset logic.
1. When catching up do not try forever and if needed reset cluster state.
2. In checking if a stream is healthy check for node drift.
3. When restarting a stream make sure the current node is stopped.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-17 13:14:33 -07:00
Derek Collison
a06e1c9b43 Make sure to also stop nodes when dealing with consumer after stream restart
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 13:16:47 -07:00
Derek Collison
3752a6c500 Make sure to stop the node on a consumer restart if still running
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 12:49:46 -07:00
Derek Collison
b0340ce598 Make sure to wait properly until we believe we are caught up to enable direct gets.
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-16 11:02:06 -07:00
Derek Collison
5e029d08d5 For older R1 streams created by previous servers we could have no cluster for the stream assignment group which would prevent scale up with newer servers.
This will inherit cluster if detected from placement tags or client cluster designation.

Signed-off-by: Derek Collison <derek@nats.io>
2023-05-10 17:59:28 -07:00
Derek Collison
b44beb4b54 Make sure to update peer set and remove old peers after new leader takes over
Signed-off-by: Derek Collison <derek@nats.io>
2023-05-09 15:15:02 -07:00
R.I.Pienaar
fb1d86d506 Record the stream and consumer info timestamps
This records the server time when info for streams and
consumers are created so that tools such as the nats cli
can calculate time deltas for last ack, last delivered and
so forth in the context of the server clock.

This will help aleviate problems with client devices experiencing
clock jitter that can show up in user interfaces as negative
seconds since last ack etc

Signed-off-by: R.I.Pienaar <rip@devco.net>
2023-05-05 18:53:09 +02:00
Ivan Kozlovic
840c264f45 Cleanup use of s.opts and fixed some lock (deadlock/inversion) issues
One should not access s.opts directly but instead use s.getOpts().
Also, server lock needs to be released when performing an account
lookup (since this may result in server lock being acquired).
A function was calling s.LookupAccount under the client lock, which
technically creates a lock inversion situation.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-05-03 14:09:02 -06:00
Derek Collison
0321eb6484 Merge branch 'main' into dev 2023-04-29 19:52:57 -07:00
Derek Collison
b27ce6de80 Add in a few more places to check on jetstream shutting down.
Add in a helper method.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 11:27:18 -07:00
Derek Collison
4eb4e5496b Do health check on startup once we have processed existing state.
Also do health checks in separate go routine.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 09:36:35 -07:00
Derek Collison
fac5658966 If we fail to create a consumer, make sure to clean up any raft nodes in meta layer and to shutdown the consumer if created but we encountered an error.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 08:15:33 -07:00
Derek Collison
546dd0c9ab Make sure we can recover an underlying node being stopped.
Do not return healthy if the node is closed, and wait a bit longer for forward progress.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-29 07:42:23 -07:00
Derek Collison
85f6bfb2ac Check healthz periodically
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:58:45 -07:00
Derek Collison
d107ba3549 Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
This will now allow the healthy call to attempt to restart those assets.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:11:08 -07:00
Neil Twigg
e30ea34625 Add op type to panics
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-27 11:38:52 +01:00
Derek Collison
4ebdb69daf Merge branch 'main' into dev 2023-04-26 11:34:37 -07:00
Derek Collison
83293f86ff Reduce threshold for compressing messages during a catchup
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-25 19:01:06 -07:00
Derek Collison
3c964a12d7 Migration could be delayed due to transferring leadership while the new leader was still paused.
Also check quicker but slow down if the state we need to have is not there yet.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-25 18:58:49 -07:00
Derek Collison
a93fd080f0 Merge branch 'main' into dev 2023-04-19 08:53:09 -07:00
Derek Collison
f6195a5ee3 A stream could have a complicated state with interior deletes.
This is a simpler way to determine if we need to consider a snapshot that involves much less time and CPU and memory.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-18 19:11:49 -07:00
Derek Collison
a319d24345 Merge branch 'main' into dev 2023-04-13 21:03:05 -07:00
Derek Collison
093564f7e0 The meta layer should snapshot if any oustanding entries are present, regardless of hash.
Fixes this test [TestJetStreamClusterDeleteAndRestoreAndRestart] which would flap since it would not snapshot since hash was same but had entries that would erase stream data.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-13 20:37:00 -07:00
Derek Collison
cc77d662bb Make sure to process consumer entries on recovery in case state was not committed.
Also sync other consumers when taking over as leader but no need to process snapshots when we are in fact the leader.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-13 18:40:17 -07:00
Derek Collison
2f4677d29e Delay a bit longer if we are not the actual leader, helpful for very large stream reports to avoid possible dupes
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-12 12:36:47 -07:00
Derek Collison
dfeac4a214 Merge branch 'main' into dev 2023-04-09 19:31:01 -07:00
Derek Collison
3b9cf1e381 Needed to do more in separate go routine to avoid deadlock
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-08 18:43:58 -07:00
Derek Collison
35bb7c1737 Pool CommittedEntries as well with a ReturnToPool() that will also recycle the Entry. Needs to integrate with upper layers
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-08 11:34:10 -07:00
Derek Collison
d02d59534f Fix data race
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-07 07:18:30 -07:00
Derek Collison
ce0d8514be Merge branch 'main' into dev 2023-04-07 05:32:05 -07:00
Derek Collison
c16915bff4 For checking the health of jetstream, do not hold the lock as we traverse the streams and consumers.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-06 11:56:55 -07:00
Derek Collison
03d5cf0871 Merge branch 'main' into dev 2023-04-04 15:14:41 -07:00
Neil Twigg
03a5a4deaf Possibly de-race sysRequest
Signed-off-by: Neil Twigg <neil@nats.io>
2023-04-04 10:30:59 +01:00
Derek Collison
c5e19e19e7 Merge branch 'main' into dev 2023-04-03 21:22:53 -07:00
Derek Collison
b0c3cf0dbd Only apply consumer entries if not recovering
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-03 17:22:50 -07:00
Derek Collison
1ae51b23a9 [ADDED] Multiple routes and ability to have per-account routes (#4001)
New configuration fields:
```
cluster {
   ...
   pool_size: 5
   accounts: ["A", "B"]
}
```

The configuration `pool_size` in the example above means that this
server will create 5 routes to a remote server, assuming that that
server has the same `pool_size` setting.

Accounts (which are not part of the `accounts[]` configuration)
are assigned a specific route in this pool, and this will be the
same route on all servers in the cluster.

Accounts that are defined in the `accounts` field will each have
a dedicated route connection. This will allow suppression of the
account name in some of the route protocols, reducing bytes transmitted
which may increase performance.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-03 15:33:46 -07:00
Derek Collison
59175c491f Fix for a datarace
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-03 14:46:57 -07:00
Derek Collison
9dd727034a Make sure to not stop raft layer when we detect we are already running the monitor
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-03 14:46:47 -07:00
Ivan Kozlovic
bd1b7b8d55 Cleanup use of s.opts and fixed some lock (deadlock/inversion) issues
One should not access s.opts directly but instead use s.getOpts().
Also, server lock needs to be released when performing an account
lookup (since this may result in server lock being acquired).
A function was calling s.LookupAccount under the client lock, which
technically creates a lock inversion situation.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-03 09:32:28 -06:00