nats-server

mirror of https://github.com/gogrlx/nats-server.git synced 2026-04-17 03:24:40 -07:00

Author	SHA1	Message	Date
Ivan Kozlovic	304744ce08	Merge pull request #3615 from nats-io/js_acc_max_streams_consumers [FIXED] JetStream: Account max streams/consumers not always honoured	2022-11-09 18:02:51 -07:00
Ivan Kozlovic	1b892837cb	[FIXED] JetStream: Account max streams/consumers not always honoured This could happen during concurrent requests where the assignments are not yet fully processed. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-11-09 17:29:20 -07:00
Derek Collison	e008e015b3	Make sure to enforce HA asset limits during peer processing as well as assignment. Signed-off-by: Derek Collison <derek@nats.io>	2022-11-09 16:24:54 -08:00
Ivan Kozlovic	ca237bdfa0	[FIXED] JetStream: Stream scale down while it has no quorum If a stream R2 had one of its server network-partitioned and at that time the stream was edited to be scaled down to an R1 it would cause the stream to no longer have quorum even when the network partition is resolved. Signed-off-by: Derek Collison <derek@nats.io> Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-11-04 09:08:31 -06:00
Derek Collison	56919ebc97	On stream proposal failures we could accidentally warn on high stream lag. We were not taking the clfs into account. Signed-off-by: Derek Collison <derek@nats.io>	2022-11-02 14:40:31 -07:00
Ivan Kozlovic	ab4470ccdc	[FIXED] JetStream: possible panic on some rare cases Very difficult to reproduce. Had to run TestJetStreamSuperClusterMoveCancel in covermode=atomic on a slow machine to hit the condition where the monitorConsumer go routine is started by RAFT node is nil, which caused the warning message to produce the panic (since n is nil) Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-11-02 10:02:09 -06:00
Ivan Kozlovic	55e651c118	[FIXED] JetStream: processing of snapshot with expired messages The issue that a "first sequence mismatch" during processing of a snapshot was causing the state to be reset and caused a lot of catchup from the follower. An attempt to fix that in PR #3567 caused an issue that was addressed in PR #3589. However, this was then causing the follower to sometime never able to catchup or took a very long time. This PR - we believe - addresses the original and subsequent issues. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-11-01 12:58:45 -06:00
Derek Collison	121bf6ebb5	Move to past check for nil Signed-off-by: Derek Collison <derek@nats.io>	2022-10-27 17:30:07 -07:00
Derek Collison	2241ad089e	Make local error since non-fatal for now. Signed-off-by: Derek Collison <derek@nats.io>	2022-10-25 16:56:10 -07:00
Derek Collison	aa52c2fecf	Added warning for high message lag into a clustered stream. Signed-off-by: Derek Collison <derek@nats.io>	2022-10-25 16:11:35 -07:00
Derek Collison	db13766f18	Merge pull request #3576 from nats-io/signal-pull-consumers Removed ephemeral consumer migration.	2022-10-25 17:35:35 -05:00
Derek Collison	f0afa49b9f	Make sure to stop raft nodes on all monitor exits. Signed-off-by: Derek Collison <derek@nats.io>	2022-10-25 14:48:28 -07:00
Derek Collison	ff2cd1d7f9	Fixed test and bug that would override consumer replicas. Signed-off-by: Derek Collison <derek@nats.io>	2022-10-25 14:35:20 -07:00
Ivan Kozlovic	7ca85e0e80	[FIXED] JetStream: Update of an R1 consumer would not get a response The update was accepted but the server would not respond to the client/CLI. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-10-25 09:04:35 -06:00
Ivan Kozlovic	f8aa3ac11d	[FIXED] JetStream: "first sequence mismatch" error on catchup with message expiration When a server was restarted and expired messages, but the leader had a snapshot that still had the old messages we would reset complete follower stream state, this fix just skips over the expired as we prepare the request to the leader. Resolves #3516 Signed-off-by: Derek Collison <derek@nats.io> Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-10-17 17:02:08 -06:00
Ivan Kozlovic	9bd11580e3	[FIXED] JetStream: User-defined ephemeral Name not used in cluster mode If the user sends a CONSUMER.CREATE request with a configuration that specifies the name that the user wants for the ephemeral consumer, this would not work on cluster mode, that is, the server would still pick a name instead of using the provided one. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-10-10 13:48:38 -06:00
Ivan Kozlovic	3472f6aec2	[FIXED] JetStream: unresponsiveness while creating raft group Originally, createRaftGroup() would not hold the jetstream's lock for the whole duration. But some race reports made us change this function to keep the lock for the whole duration. A test called TestJetStreamClusterRaceOnRAFTCreate() was demonstrating the race between "consumer info" request handling and createRaftGroup code. Since then, the race has been fixed, so this PR restores the more fine-grained locking inside createRaftGroup. Resolves #3516 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-10-04 17:27:36 -06:00
Derek Collison	52b5cd12bb	Allow meta layer to snapshot on a clean shutdown. Signed-off-by: Derek Collison <derek@nats.io>	2022-09-29 09:17:12 -06:00
Ivan Kozlovic	e151cfcd57	[FIXED] JetStream: Scale down of consumer to R1 would not get a response Updating a consumer configuration from say R3 to R1 would work but no response was received by the client sending the request. Resolves #3493 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-09-27 10:02:31 -06:00
Ivan Kozlovic	170ff49837	[ADDED] JetStream: peer (the hash of server name) in statsz/jsz A request to `$SYS.REQ.SERVER.PING.JSZ` would now return something like this: ``` ... "meta_cluster": { "name": "local", "leader": "A", "peer": "NUmM6cRx", "replicas": [ { "name": "B", "current": true, "active": 690369000, "peer": "b2oh2L6w" }, { "name": "Server name unknown at this time (peerID: jZ6RvVRH)", "current": false, "offline": true, "active": 0, "peer": "jZ6RvVRH" } ], "cluster_size": 3 } ``` Note the "peer" field following the "leader" field that contains the server name. The new field is the node ID, which is a hash of the server name. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-09-16 15:31:37 -06:00
Ivan Kozlovic	378fed164d	[FIXED] JetStream: possible panic on peer remove on server shutdown This was discovered by new test TestJetStreamClusterRemovePeerByID. I saw this on Travis and repeating the test locally with -count=10 I was able to reproduce. The issue is cc.meta being nil but accessing cc.meta.ID() directly. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-09-16 15:06:58 -06:00
Ivan Kozlovic	f113163b9f	Change ByID boolean to Peer string and add Peer id in replicas output The CLI will now be able to display the peer IDs in MetaGroupInfo if it choses to do so, and possibly help user select the peer ID from a list with a new command to remove by peer ID instead of by server name. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-09-15 10:39:23 -06:00
Deepak	e9ce118c56	Fix peer randomisation when creating consumers groups for replica=1 Signed-off-by: Deepak <sah.sslpu@gmail.com>	2022-09-14 13:58:49 +05:30
Matthias Hanel	f7cb5b1f0d	changed format of JSClusterNoPeers error (#3459 ) * changed format of JSClusterNoPeers error This error was introduced in #3342 and reveals to much information This change gets rid of cluster names and peer counts. All other counts where changed to booleans, which are only included in the output when the filter was hit. In addition, the set of not matching tags is included. Furthermore, the static error description in server/errors.json is moved into selectPeerError sample errors: 1) no suitable peers for placement, tags not matched ['cloud:GCP', 'country:US']" 2) no suitable peers for placement, insufficient storage Signed-off-by: Matthias Hanel <mh@synadia.com> Signed-off-by: Ivan Kozlovic <ivan@synadia.com> Co-authored-by: Ivan Kozlovic <ivan@synadia.com>	2022-09-08 18:25:48 -07:00
Derek Collison	c3203a3bb5	Use lostQuorum default versus live for reporting. Signed-off-by: Derek Collison <derek@nats.io>	2022-09-07 13:56:38 -07:00
Derek Collison	b86e941ce4	tweak lost quorum reporting Signed-off-by: Derek Collison <derek@nats.io>	2022-09-07 10:57:01 -07:00
Derek Collison	fbf2233e4a	Only complain about leaderless group with previous leader if we know jetstream has been running for some threshold. Signed-off-by: Derek Collison <derek@nats.io>	2022-09-07 08:47:55 -07:00
Ivan Kozlovic	5573933034	Bump back the defaultMaxTotalCatchupOutBytes to 128MB Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-31 09:19:28 -06:00
Derek Collison	98bf861a7a	Updates to stream and consumer move logic. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-30 16:11:35 -07:00
Derek Collison	56e177c329	Allow stream msgs to be compressed within the raft layer and during upper layer catchups. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-30 16:10:57 -07:00
Ivan Kozlovic	9a6a2c31ee	[ADDED] JetStream: Ability to configure the per server max catchup bytes The original value was hardcoded to 128MB and 32MB per stream. The per-server limit is lowered to 32MB but is configurable with a new configuration parameter: ``` jetstream { max_catchup: 8MB } ``` The per-stream limit was also lowered from 32MB/128,000msgs to 8MB/32,000 messages. Tests have shown no difference in performance for fast links. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-30 13:46:13 -06:00
Ivan Kozlovic	e609d12061	[FIXED] Stream info numbers may be 0 after cluster restart This would happen after multiple replicas changes. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-30 08:49:39 -06:00
Ivan Kozlovic	8c23bfea5d	Revert a change made in PR #3392 It seems to cause problems when upgrading from a v2.7.4 to main branch. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-25 14:15:59 -06:00
Matthias Hanel	970491debc	scale down happened too soon when currentCount != replicas Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-23 17:44:56 -07:00
Derek Collison	212adf5775	General improvements to clustered streams during server restart and KV/CAS scenarios. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-22 18:36:15 -07:00
Ivan Kozlovic	5663bc2fa3	Reduce length of some clustering tests Since PR #3381, the 2 tests modified here would take twice as long (around 245 seconds) to complete. Talking with Matthias, he suggested using a variable instead of a const and set it to 0 for those 2 tests since they don't really need that to be set. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 12:35:37 -06:00
Ivan Kozlovic	b1822e1b4c	Some more checks for cc.meta == nil Missed those when re-running the previous test for longer period of time. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 11:06:04 -06:00
Ivan Kozlovic	c30445657f	Fixed possible panic in monitorStream Saw this panic in code coverage run: ``` === RUN TestJetStreamClusterPeerExclusionTag panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x88 pc=0x8acd55] goroutine 97850 [running]: github.com/nats-io/nats-server/v2/server.(jetStream).monitorStream(0xc002b94780, 0xc001ecb500, 0xc003229b00, 0x0) /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:1653 +0x495 github.com/nats-io/nats-server/v2/server.(jetStream).processClusterCreateStream.func1() /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/jetstream_cluster.go:2953 +0x3b created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine /home/runner/work/nats-server/src/github.com/nats-io/nats-server/server/server.go:3063 +0xa7 ``` Was able to reproduce and reason was `meta` was nil. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-22 09:52:05 -06:00
Matthias Hanel	6bf50dbb77	induce delay prior to scale down (#3381 ) This is to avoid a narrow race between adding server and them catching up where they also register as current. Also wait for all peers to be caught up. This also avoids clearing catchup marker once catchup stalled. A stalled catchup would remove the marker causing the peer to register as current. Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-18 13:47:40 -07:00
Matthias Hanel	9892a132e7	Improve StreamMoveInProgressError (#3376 ) by adding progress indicators Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-17 15:12:32 -07:00
Derek Collison	9c9de656c6	We can't purge directories here since not 100% sure all state is in snapshot. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-17 14:57:19 -07:00
Ivan Kozlovic	7de4497815	Install consumer snapshot on clean exit and few other fixes - didRemove in applyMetaEntries() could be reset when processing multiple entries - change "no race" test names to include JetStream - separate raft nodes leader stepdown and stop in server shutdown process - in InstallSnapshot, call wal.Compact() with lastIndex+1 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-16 17:05:49 -06:00
Matthias Hanel	c6e37cf7af	Fix race between stream stop and monitorStream (#3350 ) * Fix race between stream stop and monitorStream monitorCluster stops the stream, when doing so, monitorStream needs to be stopped to avoid miscounting of store size. In a test stop and reset of store size happened first and then was followed by storing more messages via monitorStream Signed-off-by: Matthias Hanel <mh@synadia.com>	2022-08-10 19:01:21 +02:00
Ivan Kozlovic	502e5b13f7	Declare some catchup static errors Use `var .. = errors.New()`. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 17:51:31 -06:00
Ivan Kozlovic	ecddb08469	[IMPROVED] JetStream catchup can be aborted and better flow control If the leader sends messages but the follower for any reason aborts or retry the snapshot process, it will now send the error that caused this and the leader can then abort the catchup instead of waiting for its inactivity threshold of 5 seconds. Also make the send of a batch be delayed for a bit until the number of "acks" is 1/2 of the batch size or after reaching 100ms. This helps avoid trickling of messages. Tested with the new test TestJetStreamSuperClusterStreamCathupLongRTT() and see better results both in size of batches and overall time is smaller or similar but not longer. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 17:19:36 -06:00
Derek Collison	06112d6885	Reset activity interval on catchup to default vs ramp up. Tweak test. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	758b733d43	Attempt to improve long RTT catchup time during stream moves. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	e635de7526	Additional stability improvements for catchup. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:10 -06:00
Derek Collison	5a050fc10b	Improve handling when a snapshot represents state we no longer have. We would send skip messages for a sync request that was completely below our current state, but this could be more traffic then we might want. Now we only send EOF and the other side can detect the skip forward and adjust on a successful catchup. We still send skips if we can partially fill the sync request. Signed-off-by: Derek Collison <derek@nats.io>	2022-08-08 11:06:08 -06:00
Ivan Kozlovic	d96e801825	Change the report to something like this instead: ``` Replica: Server name unknown at this time (peerID: jZ6RvVRH), outdated, OFFLINE, not seen ``` After discussing with @ripienaar, this text convey better a sense that this is a transient situation. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-08-08 09:30:37 -06:00

1 2 3 4 5 ...

368 Commits