Commit Graph

158 Commits

Author SHA1 Message Date
Derek Collison
e1c8f9fb55 This improves when a server is under load or low on resources like FDs and a user is trying to delete a stream with lots of consumers.
Signed-off-by: Derek Collison <derek@nats.io>
2022-06-04 16:49:17 -07:00
Derek Collison
ef3eea4d73 Speed up raft for tests
Signed-off-by: Derek Collison <derek@nats.io>
2022-05-18 16:28:58 -07:00
Derek Collison
ccd2290355 With use cases bringing us more data I wanted to suggest these changes.
With inlining election timeout updates we double the lock contention and most likely introduced head of line issues for routes under heavy load.
Also slowing down heartbeats with so many assets being deployed in our user ecosystem, also moved the normal follower to candidate timing further out, similar to the lost quorum.
Note that the happy path transfer will still be very quick.

Signed-off-by: Derek Collison <derek@nats.io>
2022-05-15 09:55:22 -07:00
Ivan Kozlovic
2ce1dc1561 [FIXED] JetStream: possible lockup due to a return prior to unlock
This would happen in situation where a node receives an append
entry with a term higher than the node's (current leader).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-05-10 17:11:57 -06:00
Derek Collison
6f54b032d6 Raft and cluster improvements.
Signed-off-by: Derek Collison <derek@nats.io>
2022-05-03 15:20:46 -07:00
Ivan Kozlovic
2659b30113 [IMPROVED] JetStream: add file names for invalid checksums
On restart, we report when we find error in checksums, but we
did not report the name of the file.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-18 13:35:08 -06:00
Derek Collison
4aaea8e4c4 Improvements to move semantics.
Signed-off-by: Derek Collison <derek@nats.io>
2022-04-16 07:55:05 -07:00
Derek Collison
2a8b123706 Don't quickly declare lost quorum after scale up
Signed-off-by: Derek Collison <derek@nats.io>
2022-04-15 13:28:34 -07:00
Ivan Kozlovic
4e7c72ab33 Update based on code review
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-14 11:00:33 -06:00
Ivan Kozlovic
bd61d51a1c [IMPROVED] JetStream: reduce unnecessary leader election
- Wait of some sort of routing to be in place before starting
the raft run loop
- Remove use of lock in apiDispatch that was not necessary but
could have cause a route to block, causing memory growth, etc..

Unrelated rename of some tests so that they start with TestJetStream
and TestJetStreamCluster for cluster tests, fixed some flappers
and ensure that tests that change RAFT timeouts put them back
to default values on exit.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-14 10:47:14 -06:00
Derek Collison
9748925f13 Improvements to stream and consumer move.
During elected stepdown and transfer allow the new leader to take over before we stepdown.
We could receive a leader change, so make sure to also check migration state.

Signed-off-by: Derek Collison <derek@nats.io>
2022-04-14 07:27:29 -07:00
Ivan Kozlovic
1ba617bba0 Fixed data race with RAFT node election timer
Got this race:
```
==================
WARNING: DATA RACE
Read at 0x00c001c880e8 by goroutine 342:
  github.com/nats-io/nats-server/v2/server.(*raft).resetElect()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1525 +0x44
  github.com/nats-io/nats-server/v2/server.(*raft).resetElectionTimeout()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1520 +0xa4
  github.com/nats-io/nats-server/v2/server.(*raft).handleAppendEntry()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:2537 +0x12e
  github.com/nats-io/nats-server/v2/server.(*raft).handleAppendEntry-fm()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:2525 +0xcc
...

Previous write at 0x00c001c880e8 by goroutine 587:
  github.com/nats-io/nats-server/v2/server.(*raft).resetElect()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1526 +0x113
  github.com/nats-io/nats-server/v2/server.(*raft).resetElectionTimeout()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1520 +0xa4
  github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:484 +0x20d1
  github.com/nats-io/nats-server/v2/server.(*jetStream).createRaftGroup()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1497 +0x9ed
  github.com/nats-io/nats-server/v2/server.(*jetStream).processClusterCreateConsumer()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3063 +0xba4
...

==================
WARNING: DATA RACE
Read at 0x00c0006671f0 by goroutine 342:
  time.(*Timer).Stop()
      /usr/local/go/src/time/sleep.go:78 +0x84
  github.com/nats-io/nats-server/v2/server.(*raft).resetElect()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1528 +0x58
  github.com/nats-io/nats-server/v2/server.(*raft).resetElectionTimeout()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1520 +0xa4
  github.com/nats-io/nats-server/v2/server.(*raft).handleAppendEntry()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:2537 +0x12e
  github.com/nats-io/nats-server/v2/server.(*raft).handleAppendEntry-fm()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:2525 +0xcc
...

Previous write at 0x00c0006671f0 by goroutine 587:
  time.NewTimer()
      /usr/local/go/src/time/sleep.go:92 +0xb3
  github.com/nats-io/nats-server/v2/server.(*raft).resetElect()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1526 +0x104
  github.com/nats-io/nats-server/v2/server.(*raft).resetElectionTimeout()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:1520 +0xa4
  github.com/nats-io/nats-server/v2/server.(*Server).startRaftNode()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/raft.go:484 +0x20d1
  github.com/nats-io/nats-server/v2/server.(*jetStream).createRaftGroup()
      /Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1497 +0x9ed
...
```

Looked at all places where resetElect() or resetElectionTimeout() was invoked without
being protected by the raft's lock and added it.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-12 18:56:28 -06:00
Derek Collison
3bd8ee845e Fix description for Wipe
Signed-off-by: Derek Collison <derek@nats.io>
2022-04-12 16:18:38 -07:00
Ivan Kozlovic
50c3986863 [FIXED] JetStream stream catchup issues
- A stream could become leader when it should not, causing
messages to be lost.
- A catchup could stall because the server sending data
could bail out of the runCatchup routine but still send
the EOF signal.
- Deadlock with monitoring of Jsz

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-12 16:05:12 -06:00
Derek Collison
e330572cef Select next leader before truncating
Signed-off-by: Derek Collison <derek@nats.io>
2022-04-11 17:04:29 -07:00
Derek Collison
3ed1ecc032 Remove old code
Signed-off-by: Derek Collison <derek@nats.io>
2022-04-11 12:00:29 -07:00
Derek Collison
37cbac99e7 Improvements to the raft layer for general stability and support of scale up and down and asset move.
Also fixed a bug that would allow a leadership transfer when catching up.

Signed-off-by: Derek Collison <derek@nats.io>
2022-04-10 08:59:39 -07:00
Derek Collison
7e38ebcb6e Allow assets such as streams and their associated consumers to migrate between clusters.
The system will allow an update to a stream, and subsequently all attached consumers, to be placed in another cluster either directly or via tag placement.
The meta layer will scale the underlying peerset appropriately to straddle the two clusters for both the stream and consumers, taking into account the consumer type.
Control will then pass to the current leaders of the assets who will monitor the catchup status of the new peers.
(Note we can optimize this later to only traverse once across a GW for any given asset, but for now this is simpler)
Once the original leaders have determined the assets are synched it will pass leadership to a member of the new peerset.
Once the new leader has been elected, it will forward a request for the meta layer to shrink the peerset by removing the old peers.

Signed-off-by: Derek Collison <derek@nats.io>
2022-04-04 18:28:36 -07:00
Derek Collison
6b379329d8 Fix for #2955. When scaling up a stream with existing messages the existing messages were not being replicated.
Also fixed a bug where we were incorrectly not spining up the monitoring loop for a stream when going from 3->1->3.

Signed-off-by: Derek Collison <derek@nats.io>
2022-03-26 07:26:46 -07:00
Derek Collison
ef8f543ea5 Improve memory usage through JetStream storage layer.
Previously we would rely more heavily on Go's garbage collector since when we loaded a block for an underlying stream we would pass references upward to avoimd copies.
Now we always copy when passing back to the upper layers which allows us to not only expire our cache blocks but pool and reuse them.

The upper layers also had changes made to allow the pooling layer at that level to interoperate with the storage layer optionally.

Also fixed some flappers and a bug where de-dupe might not be reformed correctly.

Signed-off-by: Derek Collison <derek@nats.io>
2022-03-24 17:45:15 -06:00
Derek Collison
d7e1e5ae61 Make sure that we do not become a candidate/leader too soon or if we are not caughtup.
Signed-off-by: Derek Collison <derek@nats.io>
2022-03-24 17:45:15 -06:00
Ivan Kozlovic
c3da392832 Changes to IPQueues
Removed the warnings, instead have a sync.Map where they are
registered/unregistered and can be inspected with an undocumented
monitor page.
Added the notion of "in progress" which is the number of messages
that have beend pop()'ed. When recycle() is invoked this count
goes down.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-17 17:53:06 -06:00
Ivan Kozlovic
b4128693ed Ensure file path is correct during stream restore
Also had to change all references from `path.` to `filepath.` when
dealing with files, so that it works properly on Windows.

Fixed also lots of tests to defer the shutdown of the server
after the removal of the storage, and fixed some config files
directories to use the single quote `'` to surround the file path,
again to work on Windows.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-09 13:31:51 -07:00
Ivan Kozlovic
b7917e1d05 Various changes
- Limit IPQueue logging
- Add a rand per raft node. This is because in some situations when
running 3 servers at the same time, we would end-up with identical
inboxes for different subjects on the different nodes which would
cause panics
- Move the creation of internal subscriptions after the tracking
of the node and its peers.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-02-22 16:22:32 -07:00
Derek Collison
fb15dfd9b7 Allow replica updates during stream update.
Also add in HAAssets count to Jsz.

Signed-off-by: Derek Collison <derek@nats.io>
2022-02-13 19:33:46 -08:00
Ivan Kozlovic
29c40c874c Adding logger for IPQueue
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:14:00 -07:00
Ivan Kozlovic
48fd559bfc Reworked RAFT's leader change channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:12:11 -07:00
Ivan Kozlovic
fc7a4047a5 Renamed variables, removing the "c" that indicated it was a channel 2022-01-13 13:11:05 -07:00
Ivan Kozlovic
62a07adeb9 Replaced catchup and stream restore channels
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:09:49 -07:00
Ivan Kozlovic
645a9a14b7 Replaced RAFT's stepdown channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:09:01 -07:00
Ivan Kozlovic
2ad95f7e52 Replaced RAFT's vote request and response channels
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:08:05 -07:00
Ivan Kozlovic
d74dba2df9 Replaced RAFT's append entry response channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:48 -07:00
Ivan Kozlovic
b5979294db Replaced RAFT's append entry channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:29 -07:00
Ivan Kozlovic
ceb06d6a13 Replaced RAFT's apply channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:10 -07:00
Ivan Kozlovic
ffe50d8573 [IMPROVED] JetStream clustering with lots of streams/consumers
Some operations could cause the route to block due to lock being
held during store operations. On macOS, having lots of streams/consumers
and restarting the cluster would cause lots of concurrent IO that
would cause lock to be held for too long, causing head-of-line
blocking in processing of messages from a route.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-12 20:37:00 -07:00
Ivan Kozlovic
299b6b53eb [FIXED] JetStream: stream blocked recovering snapshot
If a node falled behind, when catching up with the rest of the
cluster, it is possible that a lot of append entries accumulate
and the server would print warnings such as:
```
[WRN] RAFT [jZ6RvVRH - S-R3F-CQw2ImK6] <some number> append entries pending
```
It would then continously print the following warning:
```
AppendEntry failed to be placed on internal channel
```
When that happens, this node would always be shown with be running the
same number of operations behind (using `nats s info`) if there are
no new messages added to the stream, or an increasing number of
operations if there is still activity.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-12-20 11:41:34 -07:00
Matthias Hanel
3e8b66286d Js leaf deny (#2693)
Along a leaf node connection, unless the system account is shared AND the JetStream domain name is identical, the default JetStream traffic (without a domain set) will be denied.

As a consequence, all clients that wants to access a domain that is not the one in the server they are connected to, a domain name must be specified.
Affected from this change are setups where: a leaf node had no local JetStream OR the server the leaf node connected to had no local JetStream. 
One of the two accounts that are connected via a leaf node remote, must have no JetStream enabled.
The side that does not have JetStream enabled, will loose JetStream access and it's clients must set `nats.Domain` manually.

For workarounds on how to restore the old behavior, look at:
https://github.com/nats-io/nats-server/pull/2693#issuecomment-996212582

New config values added:
`default_js_domain` is a mapping from account to domain, settable when JetStream is not enabled in an account.
`extension_hint` are hints for non clustered server to start in clustered mode (and be usable to extend)
`js_domain` is a way to set the JetStream domain to use for mqtt.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-12-16 16:53:20 -05:00
Ivan Kozlovic
9f30bf00e0 [FIXED] Corrupted headers receiving from consumer with meta-only
When a consumer is configured with "meta-only" option, and the
stream was backed by a memory store, a memory corruption could
happen causing the application to receive corrupted headers.

Also replaced most of usage of `append(a[:0:0], a...)` to make
copies. This was based on this wiki:
https://github.com/go101/go101/wiki/How-to-efficiently-clone-a-slice%3F

But since Go 1.15, it is actually faster to call make+copy instead.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-12-01 10:50:15 -07:00
Derek Collison
1af3ab1b4e Fix for #2666
When encountering errors for sequence mismatches that were benign we were returning an error and not processing the rest of the entries.
This would lead to more severe sequence mismatches later on that would cause stream resets.

Also added code to deal with server restarts and the clfs fixup states which should have been reset properly.

Signed-off-by: Derek Collison <derek@nats.io>
2021-11-02 14:38:22 -07:00
Derek Collison
bbffd71c4a Improvements to meta raft layer around snapshots and recovery.
Signed-off-by: Derek Collison <derek@nats.io>
2021-10-12 05:53:52 -07:00
Derek Collison
1a4410a3f7 Added more robust checking for decoding append entries.
Allow a buffer to be passed in to relive GC pressure.

Signed-off-by: Derek Collison <derek@nats.io>
2021-10-09 09:37:03 -07:00
Derek Collison
8223275c44 On cold start in mixed mode if the js servers were not > non-js we could stall.
Signed-off-by: Derek Collison <derek@nats.io>
2021-09-27 16:59:42 -07:00
Derek Collison
63c242843c Avoid panic if WAL was truncated out from underneath of us.
If we were leader stepdown as well.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-21 07:26:03 -07:00
Derek Collison
12bb46032c Fix RAFT WAL repair.
When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.

This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-20 18:41:37 -07:00
Derek Collison
08b498fbda Log error on write errors
Signed-off-by: Derek Collison <derek@nats.io>
2021-09-19 12:14:31 -07:00
Derek Collison
3099327697 During peer removal, try to remap any stream or consumer assets.
Also if we do not have room trap add peer and process there.
Fixed a bug that would treat ephemerals same as durables during remapping after peer removal.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-06 17:29:45 -07:00
Derek Collison
f13fa767c2 Remove the swapping of accounts during processing of service imports.
When processing service imports we would swap out the accounts during processing.
With the addition of internal subscriptions and internal clients publishing in JetStream we had an issue with the wrong account being used.
This was specific to delyaed pull subscribers trying to unsubscribe due to max of 1 while other JetStream API calls were running concurrently.
2021-07-26 07:57:10 -07:00
Derek Collison
6eef31c0fc Fixed peer info reports that had large last active values.
Also put in safety for lag going upside down as well.

Signed-off-by: Derek Collison <derek@nats.io>
2021-07-06 10:14:43 -07:00
Derek Collison
5ec0f291a6 When we got into certain situations where we are catching up but the first entry matches the index but not the term, we would not update term.
This would cause CPU spikes and catchup cycles that could spin.

Signed-off-by: Derek Collison <derek@nats.io>
2021-06-11 15:02:46 -07:00
Derek Collison
9ccc843382 Removing peers should wait for RemovePeer entry replication.
Signed-off-by: Derek Collison <derek@nats.io>
2021-05-19 18:58:19 -07:00