Commit Graph

133 Commits

Author SHA1 Message Date
Ivan Kozlovic
29c40c874c Adding logger for IPQueue
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:14:00 -07:00
Ivan Kozlovic
48fd559bfc Reworked RAFT's leader change channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:12:11 -07:00
Ivan Kozlovic
fc7a4047a5 Renamed variables, removing the "c" that indicated it was a channel 2022-01-13 13:11:05 -07:00
Ivan Kozlovic
62a07adeb9 Replaced catchup and stream restore channels
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:09:49 -07:00
Ivan Kozlovic
645a9a14b7 Replaced RAFT's stepdown channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:09:01 -07:00
Ivan Kozlovic
2ad95f7e52 Replaced RAFT's vote request and response channels
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:08:05 -07:00
Ivan Kozlovic
d74dba2df9 Replaced RAFT's append entry response channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:48 -07:00
Ivan Kozlovic
b5979294db Replaced RAFT's append entry channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:29 -07:00
Ivan Kozlovic
ceb06d6a13 Replaced RAFT's apply channel
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-13 13:06:10 -07:00
Ivan Kozlovic
ffe50d8573 [IMPROVED] JetStream clustering with lots of streams/consumers
Some operations could cause the route to block due to lock being
held during store operations. On macOS, having lots of streams/consumers
and restarting the cluster would cause lots of concurrent IO that
would cause lock to be held for too long, causing head-of-line
blocking in processing of messages from a route.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-01-12 20:37:00 -07:00
Ivan Kozlovic
299b6b53eb [FIXED] JetStream: stream blocked recovering snapshot
If a node falled behind, when catching up with the rest of the
cluster, it is possible that a lot of append entries accumulate
and the server would print warnings such as:
```
[WRN] RAFT [jZ6RvVRH - S-R3F-CQw2ImK6] <some number> append entries pending
```
It would then continously print the following warning:
```
AppendEntry failed to be placed on internal channel
```
When that happens, this node would always be shown with be running the
same number of operations behind (using `nats s info`) if there are
no new messages added to the stream, or an increasing number of
operations if there is still activity.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-12-20 11:41:34 -07:00
Matthias Hanel
3e8b66286d Js leaf deny (#2693)
Along a leaf node connection, unless the system account is shared AND the JetStream domain name is identical, the default JetStream traffic (without a domain set) will be denied.

As a consequence, all clients that wants to access a domain that is not the one in the server they are connected to, a domain name must be specified.
Affected from this change are setups where: a leaf node had no local JetStream OR the server the leaf node connected to had no local JetStream. 
One of the two accounts that are connected via a leaf node remote, must have no JetStream enabled.
The side that does not have JetStream enabled, will loose JetStream access and it's clients must set `nats.Domain` manually.

For workarounds on how to restore the old behavior, look at:
https://github.com/nats-io/nats-server/pull/2693#issuecomment-996212582

New config values added:
`default_js_domain` is a mapping from account to domain, settable when JetStream is not enabled in an account.
`extension_hint` are hints for non clustered server to start in clustered mode (and be usable to extend)
`js_domain` is a way to set the JetStream domain to use for mqtt.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-12-16 16:53:20 -05:00
Ivan Kozlovic
9f30bf00e0 [FIXED] Corrupted headers receiving from consumer with meta-only
When a consumer is configured with "meta-only" option, and the
stream was backed by a memory store, a memory corruption could
happen causing the application to receive corrupted headers.

Also replaced most of usage of `append(a[:0:0], a...)` to make
copies. This was based on this wiki:
https://github.com/go101/go101/wiki/How-to-efficiently-clone-a-slice%3F

But since Go 1.15, it is actually faster to call make+copy instead.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-12-01 10:50:15 -07:00
Derek Collison
1af3ab1b4e Fix for #2666
When encountering errors for sequence mismatches that were benign we were returning an error and not processing the rest of the entries.
This would lead to more severe sequence mismatches later on that would cause stream resets.

Also added code to deal with server restarts and the clfs fixup states which should have been reset properly.

Signed-off-by: Derek Collison <derek@nats.io>
2021-11-02 14:38:22 -07:00
Derek Collison
bbffd71c4a Improvements to meta raft layer around snapshots and recovery.
Signed-off-by: Derek Collison <derek@nats.io>
2021-10-12 05:53:52 -07:00
Derek Collison
1a4410a3f7 Added more robust checking for decoding append entries.
Allow a buffer to be passed in to relive GC pressure.

Signed-off-by: Derek Collison <derek@nats.io>
2021-10-09 09:37:03 -07:00
Derek Collison
8223275c44 On cold start in mixed mode if the js servers were not > non-js we could stall.
Signed-off-by: Derek Collison <derek@nats.io>
2021-09-27 16:59:42 -07:00
Derek Collison
63c242843c Avoid panic if WAL was truncated out from underneath of us.
If we were leader stepdown as well.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-21 07:26:03 -07:00
Derek Collison
12bb46032c Fix RAFT WAL repair.
When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.

This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-20 18:41:37 -07:00
Derek Collison
08b498fbda Log error on write errors
Signed-off-by: Derek Collison <derek@nats.io>
2021-09-19 12:14:31 -07:00
Derek Collison
3099327697 During peer removal, try to remap any stream or consumer assets.
Also if we do not have room trap add peer and process there.
Fixed a bug that would treat ephemerals same as durables during remapping after peer removal.

Signed-off-by: Derek Collison <derek@nats.io>
2021-09-06 17:29:45 -07:00
Derek Collison
f13fa767c2 Remove the swapping of accounts during processing of service imports.
When processing service imports we would swap out the accounts during processing.
With the addition of internal subscriptions and internal clients publishing in JetStream we had an issue with the wrong account being used.
This was specific to delyaed pull subscribers trying to unsubscribe due to max of 1 while other JetStream API calls were running concurrently.
2021-07-26 07:57:10 -07:00
Derek Collison
6eef31c0fc Fixed peer info reports that had large last active values.
Also put in safety for lag going upside down as well.

Signed-off-by: Derek Collison <derek@nats.io>
2021-07-06 10:14:43 -07:00
Derek Collison
5ec0f291a6 When we got into certain situations where we are catching up but the first entry matches the index but not the term, we would not update term.
This would cause CPU spikes and catchup cycles that could spin.

Signed-off-by: Derek Collison <derek@nats.io>
2021-06-11 15:02:46 -07:00
Derek Collison
9ccc843382 Removing peers should wait for RemovePeer entry replication.
Signed-off-by: Derek Collison <derek@nats.io>
2021-05-19 18:58:19 -07:00
Derek Collison
57395bba02 Fixed bug that could cause raft group to spin trying to catchup.
When a raft group was trying to catch up a consumer but the log is empty and we do have a snapshot but the requested sequence was the first sequence.

Signed-off-by: Derek Collison <derek@nats.io>
2021-05-07 09:13:18 -07:00
Derek Collison
db402cc444 Under heavy load and a leader change we could warn about not processing responses.
This also adjust the min election timeout to 2 seconds vs just 1 for very large networks.

Signed-off-by: Derek Collison <derek@nats.io>
2021-05-03 19:40:40 -07:00
scottf
486df98373 close tempfiles, fix path print 2021-04-22 12:47:21 -04:00
Waldemar Quevedo
c9ab7ce8a1 Fix for data race when disabling JS running out of resources
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2021-04-21 14:26:52 -07:00
Derek Collison
902b9dec12 Merge pull request #2131 from nats-io/updates
General Updates and Stability Improvements
2021-04-20 13:52:39 -07:00
Matthias Hanel
b73be52862 [fixed] only become observer if the leaf config has raft not restricted (#2125)
If a subject in the system accounts leafnode deny_imports matches $NRG.>
then jetstream is explicitly disconnected and the server can become
leader.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-04-19 13:10:49 -04:00
Derek Collison
1dd7e8c7d1 Increase apply channel size
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-16 14:00:46 -07:00
Derek Collison
8e82f36c5b Track removed peers properly
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-14 20:29:09 -07:00
Derek Collison
cf34514f9f Do not limit expansion of new peers
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-14 18:47:11 -07:00
Derek Collison
755ef74855 When a cluser of leafnodes connects to a cluster or supercluster hub and they share the system account make the leafnode servers observers.
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-12 17:00:55 -07:00
Derek Collison
69269c5653 Merge pull request #2095 from nats-io/mixed
Mixed mode improvements.
2021-04-09 16:56:41 -07:00
Jaime Piña
27e9628c3a Run gofmt -s to simplify code 2021-04-09 15:18:06 -07:00
Derek Collison
e438d2f5fa Mixed mode improvements.
1. When in mixed mode and only running the global account we now will check the account for JS.
2. Added code to decrease the cluster set size if we guessed wrong in mixed mode setup.

Signed-off-by: Derek Collison <derek@nats.io>
2021-04-09 14:58:35 -07:00
Derek Collison
14a826fb60 Check for entries going negative. Shutdown in place on server exit
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-30 11:46:15 -07:00
Derek Collison
327d913ae1 Under rare scenarios we could fail to load, but this should not be a panic.
We should recover on the lines below.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-29 07:34:28 -07:00
Derek Collison
0f71c260fb Durable consumers with R>1 had performance challenges.
This code changes the way we handle raft based proposals for consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-26 12:53:49 -07:00
Derek Collison
5d5de5925f Introduce a previous leader state in the raft layer to allow quicker responses when leaderless.
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-25 17:08:29 -07:00
Derek Collison
e53caee5e8 Enforce server limits even when dynamic limits for accounts in play.
We were not properly enforcing server limits. This commit will allow a server to enforce limits but still remain functional even at the JetStream level.
Also fixed a bug for RAFT replay that could cause instability.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-25 16:06:27 -07:00
Derek Collison
a75e8f8c80 Fix for an issue with multiple restarts that showed stalled and sometimes lost streams.
The issue was when a state was removed from a server and restarted it would catch up properly.
However upon cluster restart the system could exhibit strange behaviors. This was due to on
catchup not properly creating a meta snapshot when one was received, leaving no meta state to recover.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-22 20:06:38 -07:00
Derek Collison
0f548edcc6 Reduce sliding window for direct consumers and catchup stream windows.
Remove another possible wire blocking operation in raft.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-21 09:24:27 -07:00
Derek Collison
04a9d51035 Fix for data race
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-20 07:15:36 -07:00
Derek Collison
a205f8f2de Fix for updating peers and quorum sizes.
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-14 15:31:29 -07:00
Ivan Kozlovic
5072649540 Make sure to properly add peer after failure
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-14 15:32:12 -06:00
Derek Collison
5f78a44191 Fixed several bugs.
1. With snapshots being installed under heavy load.
2. Running catchup and missing responses due to bug in chan size for catchup.
3. various other tweaks.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-14 11:38:22 -07:00
Derek Collison
3c85df0a44 Truncate up to entry, no need for previous
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-14 05:18:52 -07:00