nats-server

mirror of https://github.com/gogrlx/nats-server.git synced 2026-04-16 19:14:41 -07:00

Author	SHA1	Message	Date
Ivan Kozlovic	5573933034	Bump back the defaultMaxTotalCatchupOutBytes to 128MB Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-31 09:19:28 -06:00
Derek Collison	98bf861a7a	Updates to stream and consumer move logic. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-30 16:11:35 -07:00
Derek Collison	56e177c329	Allow stream msgs to be compressed within the raft layer and during upper layer catchups. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-30 16:10:57 -07:00
Ivan Kozlovic	9a6a2c31ee	[ADDED] JetStream: Ability to configure the per server max catchup bytes The original value was hardcoded to 128MB and 32MB per stream. The per-server limit is lowered to 32MB but is configurable with a new configuration parameter: ``` jetstream { max_catchup: 8MB } ``` The per-stream limit was also lowered from 32MB/128,000msgs to 8MB/32,000 messages. Tests have shown no difference in performance for fast links. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-30 13:46:13 -06:00
Ivan Kozlovic	e609d12061	[FIXED] Stream info numbers may be 0 after cluster restart This would happen after multiple replicas changes. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-30 08:49:39 -06:00
Ivan Kozlovic	8c23bfea5d	Revert a change made in PR #3392 It seems to cause problems when upgrading from a v2.7.4 to main branch. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-25 14:15:59 -06:00
Matthias Hanel	970491debc	scale down happened too soon when currentCount != replicas Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-23 17:44:56 -07:00
Derek Collison	212adf5775	General improvements to clustered streams during server restart and KV/CAS scenarios. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-22 18:36:15 -07:00
Ivan Kozlovic	5663bc2fa3	Reduce length of some clustering tests Since PR #3381, the 2 tests modified here would take twice as long (around 245 seconds) to complete. Talking with Matthias, he suggested using a variable instead of a const and set it to 0 for those 2 tests since they don't really need that to be set. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 12:35:37 -06:00
Ivan Kozlovic	b1822e1b4c	Some more checks for cc.meta == nil Missed those when re-running the previous test for longer period of time. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 11:06:04 -06:00
Ivan Kozlovic	c30445657f	Fixed possible panic in monitorStream Saw this panic in code coverage run: ``` === RUN TestJetStreamClusterPeerExclusionTag panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x8acd55] goroutine 97850 [running]: github.com/nats-io/nats-server/v2/server.(jetStream).monitorStream(0xc002b94780, 0xc001ecb500, 0xc003229b00, 0x0) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1653 +0x495 github.com/nats-io/nats-server/v2/server.(jetStream).processClusterCreateStream.func1() /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2953 +0x3b created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/server.go:3063 +0xa7 ``` Was able to reproduce and reason was `meta` was nil. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 09:52:05 -06:00
Matthias Hanel	6bf50dbb77	induce delay prior to scale down (#3381 ) This is to avoid a narrow race between adding server and them catching up where they also register as current. Also wait for all peers to be caught up. This also avoids clearing catchup marker once catchup stalled. A stalled catchup would remove the marker causing the peer to register as current. Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-18 13:47:40 -07:00
Matthias Hanel	9892a132e7	Improve StreamMoveInProgressError (#3376 ) by adding progress indicators Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-17 15:12:32 -07:00
Derek Collison	9c9de656c6	We can't purge directories here since not 100% sure all state is in snapshot. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-17 14:57:19 -07:00
Ivan Kozlovic	7de4497815	Install consumer snapshot on clean exit and few other fixes - didRemove in applyMetaEntries() could be reset when processing multiple entries - change "no race" test names to include JetStream - separate raft nodes leader stepdown and stop in server shutdown process - in InstallSnapshot, call wal.Compact() with lastIndex+1 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-16 17:05:49 -06:00
Matthias Hanel	c6e37cf7af	Fix race between stream stop and monitorStream (#3350 ) * Fix race between stream stop and monitorStream monitorCluster stops the stream, when doing so, monitorStream needs to be stopped to avoid miscounting of store size. In a test stop and reset of store size happened first and then was followed by storing more messages via monitorStream Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-10 19:01:21 +02:00
Ivan Kozlovic	502e5b13f7	Declare some catchup static errors Use `var .. = errors.New()`. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 17:51:31 -06:00
Ivan Kozlovic	ecddb08469	[IMPROVED] JetStream catchup can be aborted and better flow control If the leader sends messages but the follower for any reason aborts or retry the snapshot process, it will now send the error that caused this and the leader can then abort the catchup instead of waiting for its inactivity threshold of 5 seconds. Also make the send of a batch be delayed for a bit until the number of "acks" is 1/2 of the batch size or after reaching 100ms. This helps avoid trickling of messages. Tested with the new test TestJetStreamSuperClusterStreamCathupLongRTT() and see better results both in size of batches and overall time is smaller or similar but not longer. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 17:19:36 -06:00
Derek Collison	06112d6885	Reset activity interval on catchup to default vs ramp up. Tweak test. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	758b733d43	Attempt to improve long RTT catchup time during stream moves. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	e635de7526	Additional stability improvements for catchup. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	5a050fc10b	Improve handling when a snapshot represents state we no longer have. We would send skip messages for a sync request that was completely below our current state, but this could be more traffic then we might want. Now we only send EOF and the other side can detect the skip forward and adjust on a successful catchup. We still send skips if we can partially fill the sync request. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:08 -06:00
Ivan Kozlovic	d96e801825	Change the report to something like this instead: ``` Replica: Server name unknown at this time (peerID: jZ6RvVRH), outdated, OFFLINE, not seen ``` After discussing with @ripienaar, this text convey better a sense that this is a transient situation. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 09:30:37 -06:00
Ivan Kozlovic	267e6d1958	[IMPROVED] Replicas ordering and info regarding unknown in stream info If a cluster is brought down and then partially restarted, the replica information about the non restarted node would be completely missing. The CLI could report replicas 3 but then only the leader and the running replicas, but nothing about the other node. Since this node's server name is not know, this PR adds an entry with something similar to this: ``` <unknown (peerID: jZ6RvVRH)>, outdated, OFFLINE, not seen ``` Also, replicas array is now ordered, which will help when using a watcher or repeating stream info commands in that the replicas output will be stable in regards to the list of replicas. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-07 18:54:26 -06:00
Matthias Hanel	52c4872666	better error when peer selection fails (#3342 ) * better error when peer selection fails It is pretty hard to diagnose what went wrong when not enough peers for an operation where found. This change now returns counts of reasons why peers where discarded. Changed the error to JSClusterNoPeers as it seems more appropriate of an error for that operation. Not having enough resources is one of the conditions for a peer not being considered. But so is having a non matching tag. Which is why JSClusterNoPeers seems more appropriate In addition, JSClusterNoPeers was already used as error after one call to selectPeerGroup already. example: no suitable peers for placement: peer selection cluster 'C' with 3 peers offline: 0 excludeTag: 1 noTagMatch: 2 noSpace: 0 uniqueTag: 0 misc: 0 Examle for mqtt: mid:12 - "mqtt" - unable to connect: create sessions stream for account "$G": no suitable peers for placement: peer selection cluster 'MQTT' with 3 peers offline: 0 excludeTag: 0 noTagMatch: 0 noSpace: 0 uniqueTag: 0 misc: 0 (10005) Signed-off-by: Matthias Hanel <mh@synadia.com> * review comment Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-06 00:17:01 +02:00
Matthias Hanel	c56f3b9fbd	Adding account purge operation (#3319 ) * Adding account purge operation The new request is available for the system account. The subject to send the request to is $JS.API.ACCOUNT.PURGE.* With the name of the account to purge instead of the wildcard. Also added directory cleanup code such that server do not end up with empty streams directories and account dirs that only contain streams Also adding ACCOUNT to leaf node domain rewrite table Addresses #3186 and #3306 by providing a way to get rid of the streams for existing and non existing accounts Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-05 18:24:19 +02:00
Derek Collison	5e98263de8	General stability improvements Signed-off-by: Derek Collison <derek@nats.io>	2022-07-29 16:02:31 -07:00
Derek Collison	50a25881e2	Encrypt meta and raft states. Signed-off-by: Derek Collison <derek@nats.io>	2022-07-29 08:10:57 -07:00
Ivan Kozlovic	5786d2d9d6	Changed "return" to "continue" Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-27 18:23:54 -06:00
Ivan Kozlovic	88203dd5d5	Fixed a panic when consumer is closed Panic was: ``` === RUN TestJetStreamClusterDelete panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xcec8fb] goroutine 1761 [running]: github.com/nats-io/nats-server/v2/server.(stream).config(0x0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/stream.go:1192 +0x5b github.com/nats-io/nats-server/v2/server.(consumer).replica(0xc000101400) /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3580 +0xea github.com/nats-io/nats-server/v2/server.(jetStream).monitorConsumer(0xc0001d2790, 0xc000101400, 0xc0004df0e0) /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3733 +0xe06 github.com/nats-io/nats-server/v2/server.(jetStream).processClusterCreateConsumer.func1() /home/travis/gopath/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:3445 +0x4d created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine /home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3057 +0x85 FAIL github.com/nats-io/nats-server/v2/server 9.911s ``` Seem to have been introduced in #3282 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-27 16:51:10 -06:00
Matthias Hanel	3358205de3	add implementation for consumer replica change (#3293 ) * add implementation for consumer replica change fixes #3262 also check peer list on every update Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-27 03:56:28 +02:00
Matthias Hanel	04ffed48b0	fix peer tracking by removing peers before scaledown (#3289 ) in doRemovePeerAsLeader the leader also records the removed peer in the removed set Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-26 22:01:03 +02:00
Matthias Hanel	6212087feb	fix race by locking arround o.isLeader (#3291 ) Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-26 21:49:04 +02:00
Ivan Kozlovic	fe370955c8	Merge pull request #3288 from nats-io/debug_test_failure [FIXED] JetStream: Some scaling up issues	2022-07-26 08:57:17 -06:00
Ivan Kozlovic	1a6c5f1c90	[FIXED] JetStream: Some scaling up issues - Send snapshot only if leader - When processing snapshot, start with a smaller inactivity interval that will double up to 10sec or use 10sec directly once we get a message. Reason for that is that it is possible that the request for snapshot is sent while the leader has not yet setup the subscription that receives the requests (or subscription has not fully reached the cluster). - Don't remember snapfile on err. - Do not consider current if we have not had any activity. - Stabilize stream scale up under active heavy publishing. - Due to the publish pressure move the check for followers direct subs spinning up til after we stop publishing. Signed-off-by: Derek Collison <derek@nats.io> Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-25 18:44:18 -06:00
Ivan Kozlovic	ebeca00e20	[FIXED] JetStream/Cluster: Stream names/infos would return bad response If there are more stream names that the current limit of 1024, getting the list of names would return them all instead of using pagination. For "stream infos", the Total amount returned would be the API limit instead of the actual number of streams. Resolves https://github.com/nats-io/natscli/issues/541 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-25 14:41:05 -06:00
Matthias Hanel	5a720d4977	down scale consumer before downscale of stream (#3282 ) Now monitorStream waits with scaling down the stream until all monitorConsumer have scaled down their respective consumer Also update consumer assignment for later use in monitorConsumer Same for stream assignment in monitorStream Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-22 19:54:13 +02:00
Ivan Kozlovic	a02a617c05	Merge pull request #3280 from nats-io/fix_3273 [IMPROVED] JetStream: stream already exists error description	2022-07-21 10:53:47 -06:00
Ivan Kozlovic	1da5ecfb96	[IMPROVED] JetStream: stream already exists error description The `JSStreamNameExistErr` will now include in the description that the stream exists with a different configuration, because that is the error clients would get when trying to add a stream with a different configuration (otherwise this is a no-op and client don't get an error). Since that error was used in case of restore, a new error is added but uses the same description prefix "stream name already in use" but adds ", cannot restore" to indicate that this is a restore failure because the stream already exists. Resolves #3273 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-07-21 10:20:07 -06:00
Derek Collison	f2abdaeb43	Make sure to protect against mset == nil Signed-off-by: Derek Collison <derek@nats.io>	2022-07-21 06:53:26 -07:00
Matthias Hanel	89b5e872ac	Move and cancel fixes (#3270 ) The Move/Cancel/Downscale mechanism did not take into account that the consumer's replica count can be set independently. This also alters peer selection to have the ability to skip unique tag prefix check for server that will be replaced. Say you have 3 az, and want to add another server to az:1, in order to replace a server that is the same zone. Without this change, uniqueTagPrefix check would filter the server to replace with and cause a failure. The cancel move response could not be received due to the wrong account name. Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-18 18:42:03 +02:00
Matthias Hanel	023500e1da	add the ability to cancel a move in progress (#3253 ) * add the ability to cancel a move in progress Move to individual subjects for move and cancel_move New subjects are: $JS.API.ACCOUNT.STREAM.MOVE.. $JS.API.ACCOUNT.STREAM.CANCEL_MOVE.. last and second to last token are account and stream name Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-12 21:54:18 +02:00
Derek Collison	85123861d4	Merge pull request #3249 from nats-io/catchup_eof Fix for stalled catchup in endless cycle on EOF	2022-07-07 17:54:07 -07:00
Derek Collison	333e2fc2f1	Fix for stalled catchup in endless cycle on EOF trying to retrieve catchup msg. A customer experienced and endless failure to have a stream cacthup. The current leader was being asked for a message from a snapshot that was larger then what we had, resulting in EOF which silently failed. We now detect this and signal end of catchup and redo the bad snapshot if possible. Signed-off-by: Derek Collison <derek@nats.io>	2022-07-07 13:42:41 -07:00
Matthias Hanel	f0ee56cf0a	Fix unique_tag issue with stream replica increase When increasing the replica count unique tags for already existing peers where ignored, which could lead to bad placement Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-07 21:22:55 +02:00
Derek Collison	c49d081341	Fix data race Signed-off-by: Derek Collison <derek@nats.io>	2022-07-07 09:05:50 -07:00
Matthias Hanel	70be4b77f9	fixes peer removal, simplifies move, more tests Make sure when processing a peer removal that the stream assignment agrees. When a new leader takes over it can resend a peer removal, and if the stream/consumer really was rescheduled we could remove by accident. Also need to make sure that when we remove a stream we remove the node as part of the stream assignment. If we didn't, if the same asset returned to this server we would not start up the monitoring loop. Simplify migration logic in monitorStream, to be driven by leader only Improved unit tests Added failure when server not in peer list Move command does not require server anymore Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-07-07 03:32:13 +02:00
Derek Collison	722ae548dd	Fix data race Signed-off-by: Derek Collison <derek@nats.io>	2022-07-06 09:11:22 -07:00
Derek Collison	47bef915ed	Allow all members of a replicated stream to participate in direct access. We will wait until a non-leader replica is current to subscribe. Signed-off-by: Derek Collison <derek@nats.io>	2022-07-03 11:08:24 -07:00
Matthias Hanel	6bd14e1b7a	removed commented out code (#3228 ) Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-06-29 20:31:12 +02:00

1 2 3 4 5 ...

341 Commits