nats-server

mirror of https://github.com/gogrlx/nats-server.git synced 2026-04-17 11:24:44 -07:00

Author	SHA1	Message	Date
Derek Collison	b806a8e7e7	Do not opt-out of normal processing for leadership transfers, but make sure they are only processed if explicitly new Signed-off-by: Derek Collison <derek@nats.io>	2023-04-03 14:46:55 -07:00
Derek Collison	58ca525b3b	Process replicated ack regardless of store update. Delay but still stepdown Signed-off-by: Derek Collison <derek@nats.io>	2023-04-02 03:53:16 -07:00
Derek Collison	874b2b2e02	Hold the lock while checking health since we could update catchup state. Do not stepdown right away when executing leadership transfer, wait for the commit. Signed-off-by: Derek Collison <derek@nats.io>	2023-04-02 03:53:08 -07:00
Derek Collison	4646f4af5d	Do not allow any JetStream leaders to be placed on a lameduck server Signed-off-by: Derek Collison <derek@nats.io>	2023-03-29 20:15:41 -07:00
Derek Collison	e274693490	On bad or corrupt message load during commit, reset WAL vs mark write error Signed-off-by: Derek Collison <derek@nats.io>	2023-03-29 14:07:14 -07:00
Derek Collison	35d1a7747a	Snapshots of no length can hold state as well Signed-off-by: Derek Collison <derek@nats.io>	2023-03-29 12:44:04 -07:00
Derek Collison	182bf6cbae	Bug fixes and general stability improvements. 1. If reset ignore Applied() that are greater then our commit. 2. Improved StepDown() by placing at back of queue if preferred. 3. Improved handling of leadership transfer during StepDown(). 4. Do not store EntryLeaderTransfer records on disk. 5. Remove un-needed processing of older terms. 6. If append entry has higher term, also inherit pterm. 7. Only inherit a candidate's term if we decide to vote for them. Signed-off-by: Derek Collison <derek@nats.io>	2023-03-29 12:43:46 -07:00
Derek Collison	ec89823e1c	Only process out of resources condition from raft layer if err matches condition Signed-off-by: Derek Collison <derek@nats.io>	2023-03-23 08:13:22 -07:00
Derek Collison	ed9de4b0a1	Improved publisher performance under some instances of asymmetric network latency clusters on interest based streams. Under asymmetric network latency based clusters, if a node in an R3 was replicating a consumer and the parent stream, but was the leader of neither, but the path from the stream leader was faster then the consumer leader a replicated ack could arrive before the message itself. In this case we used to forward a delete message request to the stream leader which would then replicate that to all stream replicas, causing more work which could lead to increased publisher times on clients connected to the slow node. Signed-off-by: Derek Collison <derek@nats.io>	2023-03-20 20:53:45 -07:00
Derek Collison	0c1301ec14	Fix for data race Signed-off-by: Derek Collison <derek@nats.io>	2023-03-19 10:52:52 -07:00
Derek Collison	531fadd3e2	Don't warn if error is node closed. Signed-off-by: Derek Collison <derek@nats.io>	2023-03-15 16:45:33 -07:00
Derek Collison	2beca1a2a6	Partial cache errors are also not critical write errors Signed-off-by: Derek Collison <derek@nats.io>	2023-03-01 22:52:02 -08:00
Derek Collison	c586014477	General raft improvements under heavy corruption. Do not exit candidate state in place when stepping down, would cause double vote requests. When truncating our WAL make sure to adjust commit and applied as needed. On a miss where the index is less than ours, if we can not find the entry reset our state. For a vote, if last processed term is higher than ours always agree if no vote has been cast. If terms are equal make sure the requestor's index is at least as high as ours. If we decide not to vote for someone, and we have not voted and we are a better fit, move forward with a campaign. Signed-off-by: Derek Collison <derek@nats.io>	2023-03-01 22:06:50 -08:00
Derek Collison	fa8afba68f	Only warn on write errors if not closed in case they linger under pressure and blocking on dios Signed-off-by: Derek Collison <derek@nats.io>	2023-02-27 18:56:55 -08:00
Derek Collison	2711460b7b	Prevent benign spin between competing leaders with same index but differen term. Remove lock from route processing for updating peers progress, altready handled in trackPeer. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-27 11:21:33 -08:00
Derek Collison	4fa0ea32c3	[FIXED] If a truncate for a raft WAL failed we could spin. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-25 19:07:27 -08:00
Derek Collison	ea2bfad8ea	Fixed bug where snapshot would not compact through applied. This mean a subsequent request for exactly applied would return that entry only not the full state snapshot. Fixed bug where we would not snapshot when we should. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-23 22:19:37 -08:00
Derek Collison	45859e6476	Make sure preferred peer for stepdown is healthy. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-23 13:06:13 -08:00
Neil Twigg	68961ffedd	Refactor `ipQueue` to use generics, reduce allocations	2023-02-21 14:50:09 +00:00
Derek Collison	e028b7230a	Need to compact wal on snapshot to pindex+1 Signed-off-by: Derek Collison <derek@nats.io>	2023-02-20 14:37:37 -08:00
Derek Collison	9c02be2409	Various fixes for snapshots. Due to bug, in rare circumstances could write an empty snapshot for aplied == 0. This would cause a spinning at the raft layer. 1. Allow Truncate() to also properly do a reset of the store when terms were only mismatch. 2. During testing fixed memstore truncate and also made sure per subject info was also cleaned up. 3. Then added fix to detect a bad snapshot on initialization and remove. 4. Do not allow snapshots for applied == 0. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-04 13:46:06 -08:00
Derek Collison	e9a983c802	Do not let !NeedSnapshot() avoid snapshots and compaction. Signed-off-by: Derek Collison <derek@nats.io>	2023-02-01 22:05:25 -07:00
Derek Collison	6058056e3b	Minor fixes and optimizations for snapshots. We were snappshotting more then needed, so double check that we should be doing this at the stream and consumer level. At the raft level, we should have always been compacting the WAL to last+1, so made that consistent. Also fixed bug that would not skip last if more items behind the snapshot. Signed-off-by: Derek Collison <derek@nats.io>	2023-01-30 17:54:18 -08:00
Derek Collison	bf49f23bb1	Only hold on to so many pending in memory, will fetch from WAL Signed-off-by: Derek Collison <derek@nats.io>	2023-01-28 11:34:55 -08:00
Neil Twigg	83932b4be6	Don't mark a clustered stream as unhealthy if making forward progress, add `TestJetStreamClusterCurrentVsHealth`	2023-01-26 16:57:34 +00:00
Derek Collison	ad53d455f8	When migrating leaders off a server when the leafnode is not connected, also ensure leaders can not return until reconnected. Signed-off-by: Derek Collison <derek@nats.io>	2023-01-05 08:02:50 -08:00
Todd Beets	47c87eb71c	fix and test for clustered mem store asset no-quorum if leader restarted	2022-12-14 16:16:08 -08:00
Derek Collison	894115b82b	Fix for server panic when consumer state was not decoded correctly. The bug was when a timestamp for the pending state was exactly -1 which could happen based on timing of the redlivered pending items which would set pending.Timestamp into the future potentially and the timing on the encodeConsumerState call. Minor fixes to raft. Signed-off-by: Derek Collison <derek@nats.io>	2022-12-06 14:16:20 -08:00
Derek Collison	3ac6052b32	Updated pae threshold and reporting modulo to not spam logs as much. Signed-off-by: Derek Collison <derek@nats.io>	2022-11-11 16:08:58 -08:00
Derek Collison	98bf861a7a	Updates to stream and consumer move logic. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-30 16:11:35 -07:00
Derek Collison	212adf5775	General improvements to clustered streams during server restart and KV/CAS scenarios. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-22 18:36:15 -07:00
Ivan Kozlovic	7de4497815	Install consumer snapshot on clean exit and few other fixes - didRemove in applyMetaEntries() could be reset when processing multiple entries - change "no race" test names to include JetStream - separate raft nodes leader stepdown and stop in server shutdown process - in InstallSnapshot, call wal.Compact() with lastIndex+1 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-16 17:05:49 -06:00
Ivan Kozlovic	3c9a7cc6e5	Move to Go 1.19, remote io/util, fix data race and a flapper Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-05 09:55:37 -06:00
Ivan Kozlovic	37c923c28e	Downgrade a RAFT warning to debug This is related to PR #3307. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-02 18:06:39 -06:00
Derek Collison	5e98263de8	General stability improvements Signed-off-by: Derek Collison <derek@nats.io>	2022-07-29 16:02:31 -07:00
Derek Collison	27d87a68a4	Improvements to raft layer with snapshots on catchup. Signed-off-by: Derek Collison <derek@nats.io>	2022-07-29 09:01:03 -07:00
Matthias Hanel	04ffed48b0	fix peer tracking by removing peers before scaledown (#3289 ) in doRemovePeerAsLeader the leader also records the removed peer in the removed set Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-26 22:01:03 +02:00
Ivan Kozlovic	1a6c5f1c90	[FIXED] JetStream: Some scaling up issues - Send snapshot only if leader - When processing snapshot, start with a smaller inactivity interval that will double up to 10sec or use 10sec directly once we get a message. Reason for that is that it is possible that the request for snapshot is sent while the leader has not yet setup the subscription that receives the requests (or subscription has not fully reached the cluster). - Don't remember snapfile on err. - Do not consider current if we have not had any activity. - Stabilize stream scale up under active heavy publishing. - Due to the publish pressure move the check for followers direct subs spinning up til after we stop publishing. Signed-off-by: Derek Collison <derek@nats.io> Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-25 18:44:18 -06:00
Matthias Hanel	51b6d5233f	Fix raft issue where pindex of follower was off by 1 (#3277 ) introduced by `57395bba02` Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-21 00:51:26 +02:00
Derek Collison	e1c8f9fb55	This improves when a server is under load or low on resources like FDs and a user is trying to delete a stream with lots of consumers. Signed-off-by: Derek Collison <derek@nats.io>	2022-06-04 16:49:17 -07:00
Derek Collison	ef3eea4d73	Speed up raft for tests Signed-off-by: Derek Collison <derek@nats.io>	2022-05-18 16:28:58 -07:00
Derek Collison	ccd2290355	With use cases bringing us more data I wanted to suggest these changes. With inlining election timeout updates we double the lock contention and most likely introduced head of line issues for routes under heavy load. Also slowing down heartbeats with so many assets being deployed in our user ecosystem, also moved the normal follower to candidate timing further out, similar to the lost quorum. Note that the happy path transfer will still be very quick. Signed-off-by: Derek Collison <derek@nats.io>	2022-05-15 09:55:22 -07:00
Ivan Kozlovic	2ce1dc1561	[FIXED] JetStream: possible lockup due to a return prior to unlock This would happen in situation where a node receives an append entry with a term higher than the node's (current leader). Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-05-10 17:11:57 -06:00
Derek Collison	6f54b032d6	Raft and cluster improvements. Signed-off-by: Derek Collison <derek@nats.io>	2022-05-03 15:20:46 -07:00
Ivan Kozlovic	2659b30113	[IMPROVED] JetStream: add file names for invalid checksums On restart, we report when we find error in checksums, but we did not report the name of the file. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-04-18 13:35:08 -06:00
Derek Collison	4aaea8e4c4	Improvements to move semantics. Signed-off-by: Derek Collison <derek@nats.io>	2022-04-16 07:55:05 -07:00
Derek Collison	2a8b123706	Don't quickly declare lost quorum after scale up Signed-off-by: Derek Collison <derek@nats.io>	2022-04-15 13:28:34 -07:00
Ivan Kozlovic	4e7c72ab33	Update based on code review Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-04-14 11:00:33 -06:00
Ivan Kozlovic	bd61d51a1c	[IMPROVED] JetStream: reduce unnecessary leader election - Wait of some sort of routing to be in place before starting the raft run loop - Remove use of lock in apiDispatch that was not necessary but could have cause a route to block, causing memory growth, etc.. Unrelated rename of some tests so that they start with TestJetStream and TestJetStreamCluster for cluster tests, fixed some flappers and ensure that tests that change RAFT timeouts put them back to default values on exit. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-04-14 10:47:14 -06:00
Derek Collison	9748925f13	Improvements to stream and consumer move. During elected stepdown and transfer allow the new leader to take over before we stepdown. We could receive a leader change, so make sure to also check migration state. Signed-off-by: Derek Collison <derek@nats.io>	2022-04-14 07:27:29 -07:00

1 2 3 4

197 Commits