nats-server

mirror of https://github.com/gogrlx/nats-server.git synced 2026-04-02 03:38:42 -07:00

Author	SHA1	Message	Date
Phil Pennock	65be9706b3	WIP: socket stats At this point, we're collecting for gateways, we have the general framework in place, and we're populating unpublished expvars.	2020-10-13 18:26:28 -04:00
Ivan Kozlovic	2ad2bed170	[ADDED] Support for route hostname resolution We previously simply called DialTimeout() on a route's url when soliciting. If it resolved to the IP of the host, it would create a route to self, which server detects, but then would not try again with other IPs that would have allowed to form a cluster with other servers running on the other IPs. This PR keeps track of local IPs + cluster port and exclude them from the list of IPs returned by LookupHost API. This even prevent solicitation of routes to self. Only non-local IPs will be tried. Resolves #1586 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-09-08 13:40:17 -06:00
Phil Pennock	3c680eceb9	Inhibit Go's default TCP keepalive settings for NATS (#1562 ) Inhibit Go's default TCP keepalive settings for NATS Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value. This affects all TCP listeners. The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these. Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol. Change the tests so we test the same logic. Do not change HTTP monitoring, profiling, or the websocket API listeners. Change KeepAlive on client connections too.	2020-08-14 13:37:59 -04:00
Ivan Kozlovic	b9764db478	Renamed gossipURLs type and moved its declaration to util.go Also made the add/remove/getAsStringSlice receiver for this type. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-07-16 11:22:58 -06:00
Ivan Kozlovic	9b0967a5d1	[FIXED] Handling of gossiped URLs If some servers in the cluster have the same connect URLs (due to the use of client advertise), then it would be possible to have a server sends the connect_urls INFO update to clients with missing URLs. Resolves #1515 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-07-15 17:39:12 -06:00
Ivan Kozlovic	4d495104de	Fixed no_responders use of sendProtoNow() The call sendProtoNow() should not normally be used (only when setting up a connection when the writeloop is not yet started and server needs to send something before being able to start the writeLoop. Instead, code should use enqueueProto(). For this particular case though, use queueOutbound() directly and add to the producer's pcd map. Also fixed other places where we were using queueOutbound() + flushSignal() which is what enqueueProto is doing. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-07-09 17:55:14 -06:00
Ivan Kozlovic	9288283d90	Fixed accept loops that could leave connections opened This was discovered with the test TestLeafNodeWithGatewaysServerRestart that was sometimes failing. Investigation showed that when cluster B was shutdown, one of the server on A that had a connection from B that just broke tried to reconnect (as part of reconnect retries of implicit gateways) to a server in B that was in the process of shuting down. The connection had been accepted but createGateway not called because the server's running boolean had been set to false as part of the shutdown. However, the connection was not closed so the server on A had a valid connection to a dead server from cluster B. When the B cluster (now single server) was restarted and a LeafNode connection connected to it, then the gateway from B to A was created, that server on A did not create outbound connection to that B server because it already had one (the zombie one). So this PR strengthens the starting of accept loops and also make sure that if a connection (all type of connections) is not accepted because the server is shuting down, that connection is properly closed. Since all accept loops had almost same code, made a generic function that accept functions to call specific create connection functions. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-07-06 17:03:19 -06:00
Derek Collison	120402241a	Fix for #1486 Signed-off-by: Derek Collison <derek@nats.io>	2020-06-18 21:04:34 -07:00
Derek Collison	4dee03b587	Allow mixed TLS and non-TLS on same port Signed-off-by: Derek Collison <derek@nats.io>	2020-06-05 18:04:11 -07:00
Ivan Kozlovic	5dba3cdd75	[FIXED] Race condition during implicit Gateway reconnection Say server in cluster A accepts a connection from a server in cluster B. The gateway is implicit, in that A does not have a configured remote gateway to B. Then the server in B is shutdown, which A detects and initiate a single reconnect attempt (since it is implicit and if the reconnect retries is not set). While this happens, a new server in B is restarted and connects to A. If that happens before the initial reconnect attempt failed, A will register that new inbound and do not attempt to solicit because it has already a remote entry for gateway B. At this point when the reconnect to old server B fails, then the remote GW entry is removed, and A will not create an outbound connection to the new B server. We fix that by checking if there is a registered inbound when we get to the point of removing the remote on a failed implicit reconnect. If there is one, we try the reconnection. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-05-22 13:01:17 -06:00
Derek Collison	0129a7fa09	Header support for GWs Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:33:56 -07:00
Derek Collison	ea5e5bd364	Services rewrite #2 This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions. This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum. This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF. This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency. We support the following Status codes, these are directly from HTTP. 400 Bad Request (request did not have a reply subject) 408 Request Timeout (when system detects request interest went away, old request style to make dependable).. 503 Service Unavailable (no service responders running) 504 Service Timeout (The new response threshold expired) Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:26:46 -07:00
Derek Collison	df774e44b0	Rework how service imports are handled to avoid performance hits Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:18:34 -07:00
Derek Collison	8d1f3cc7c2	Allow JetStream consumers to work across multi-server hops Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:16:03 -07:00
Derek Collison	0c2d539b06	Remote request API Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:13:22 -07:00
Derek Collison	0fb7ee32bc	Auto-expiration of ephemeral push based observables Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:07:02 -07:00
Ivan Kozlovic	fef94759ab	[FIXED] Update remote gateway URLs when node goes away in cluster If a node in the cluster goes away, an async INFO is sent to inbound gateway connections so they get a chance to update their list of remote gateway URLs. Same happens when a node is added to the cluster. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-04-20 13:48:47 -06:00
Derek Collison	82f585d83a	Updated to also resend leafnode connect on GW connect via first INFO Signed-off-by: Derek Collison <derek@nats.io>	2020-04-08 09:55:19 -07:00
Matthias Hanel	6a1c3fc29b	Moving inbound tracing to the caller (client.parse) Tracing for outgoing operations is always done while holding the client lock. Signed-off-by: Matthias Hanel <mh@synadia.com>	2020-03-04 17:31:18 -05:00
Matthias Hanel	fe373ac597	Incorporating comments. c -> client defer in oneliner argument order Signed-off-by: Matthias Hanel <mh@synadia.com>	2020-03-04 15:48:19 -05:00
Matthias Hanel	f5bd07b36c	[FIXED] trace/debug/sys_log reload will affect existing clients Fixed #1296, by altering client state on reload Detect a trace level change on reload and update all clients. To avoid data races, read client.trace while holding the lock, pass the value into functionis that trace while not holding the lock. Delete unused client.debug. Signed-off-by: Matthias Hanel <mh@synadia.com>	2020-03-04 13:54:15 -05:00
Ivan Kozlovic	47b08335a4	[FIXED] Reset of tlsName only for x509.HostnameError For issue #1256, we cleared the possibly saved tlsName on Hanshake failure. However, this meant that for normal use cases, if a reconnect failed for any reason we would not be able to reconnect if it is an IP until we get back to the URL that contained the hostname. We now clear only if the handshake error is of x509.HostnameError type, which include errors such as: ``` "x509: Common Name is not a valid hostname: <x>" "x509: cannot validate certificate for <x> because it doesn't contain any IP SANs" "x509: certificate is not valid for any names, but wanted to match <x>" "x509: certificate is valid for <x>, not <y>" ``` Applied the same logic to solicited gateway connections, and fixed the fact that the tlsConfig should be cloned (since we set the ServerName). I have also made a change for leafnode connections similar to what we are doing for gateway connections, which is to use the saved tlsName only if tlsConfig.ServerName is empty, which may not be the case for users that embed NATS Server and pass directly tls configuration. In other words, if the option TLSConfig.ServerName is not empty, always use this value. Relates to #1256 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-01-28 13:16:38 -07:00
Ivan Kozlovic	c097357b52	[FIXED] More than expected switch to Interest-Only mode for account When an account is switched to interest-only mode due to no interest, it was not possible to switch that account more than once. But the function switchAccountToInterestMode() that triggers a switch could possibly doing it more than once. This should not cause problems but increased the number of traces in a big super cluster. Also fixed some flappers and a data race. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-01-09 13:35:08 -07:00
Ivan Kozlovic	c73be88ac0	Updated based on comments Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-01-06 16:57:48 -07:00
Ivan Kozlovic	947798231b	[UPDATED] TCP Write and SlowConsumer handling - All writes will now be done by the writeLoop, unless when the writeLoop has not been started yet (likely in connection init). - Slow consumers for non CLIENT connections will be reported but not failed. The idea is that routes, gateway, etc.. connections should stay connected as much as possible. However if a flush operation times out and no data at all has been written, the connection will be closed (regardless of type). - Slow consumers due to max pending is only for CLIENT connections. This allows sending of SUBs through routes, etc.. to not have to be chunked. - The backpressure to CLIENT connections is increased (up to 1sec) based on the sub's connection pending bytes level. - Connection is flushed on close from the writeLoop as to not block the "fast path". Some tests have been fixed and adapted since now closeConnection() is not flushing/closing/removing connection in place. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-12-31 15:06:27 -07:00
Ivan Kozlovic	a22da91647	[FIXED] Closing of Gateway or Route TLS connection may hang This could happen if the remote server is running but not dequeueing from the socket. TLS connection Close() may send/read and so we need to protect with a deadline. For non client/leaf connection, do not call flushOutbound(). Set the write deadline regardless of handshakeComplete flag, and set it to a low value. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-12-04 17:27:00 -07:00
Ivan Kozlovic	a0f8bd112e	[FIXED] Prevent A- for account that has service reply subscription Prevent sending an A- for a given account if the server has this account registered and an internal service reply subscription. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-26 16:21:36 -07:00
Derek Collison	b2cbde2616	Match comment about hash size Signed-off-by: Derek Collison <derek@nats.io>	2019-11-16 17:56:06 -08:00
Ivan Kozlovic	9b837813b1	Process service replies in gateway inbound Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-16 17:43:44 -07:00
Derek Collison	747ba1dc09	Change , remove T placeholder, 8 to 6 on hash len Signed-off-by: Derek Collison <derek@nats.io>	2019-11-16 13:06:56 -08:00
Derek Collison	6ad8287bbe	Introduced wildcard handling of _R_ mapped replies. We had too much special processing, so reduced to a single wildcard which will propagate across routes and gateways and is consistent with gateway handling of globally routed subjects and timeouts. Signed-off-by: Derek Collison <derek@nats.io>	2019-11-16 12:50:53 -08:00
Ivan Kozlovic	d046f7945f	Bump defaultGatewayRecentSubExpiration and RC2 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-15 10:06:38 -07:00
Ivan Kozlovic	b561bde366	Alternate approach to GW reply mapping expiration Use centralized sync map to gather *client that have GW replies. Tested with concurrent receiving clients and perf is as good as with timer per client but reduces need of that timer per client object. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-11 13:36:24 -07:00
Ivan Kozlovic	8a8695d07c	Backward compatibility with previous servers Want to keep this commit separate so that we can easily remove when we no longer want to support both prefixes. - If this server receives a "$GR." message, it takes the subject and tries to process this locally. If there is no cluster race reply may be received ok (like before). - If this server sends a routed reply, it detects if sending to an older server (then uses $GR.) or not (then uses $GNR) - Gateway INFO has a new field that indicates if the server is using the new prefix. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-08 16:22:34 -07:00
Ivan Kozlovic	9b7dab0548	Updates based on code review - Add atomic in client to skip check in processInboundClientMsg() if value is 0. Avoids getting the lock in fast path if not needed. - Have a timer per client instead of the global server list that was expiring: noticed a lot of contention there when running some perf/profiling tests. The timer is also not reset for every timestamp that is not yet expired since this too affects performance. Instead fires are regular interval and cleared when map is empty after a cycle. - Move processing of gw map rely on its own function (in inbound msg). I have verified that this is inlined same way as when code was directly in processInboundClientMsg. - Use string(subj[]) for prefix detection: I have verified that it is actually faster. - Builds the RMSG with appends to local buffer in handleGatewayReply() instead of using fmt.Sprintf(). Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-08 15:56:28 -07:00
Ivan Kozlovic	aa843945c9	Work on Gateways reply mapping - New prefix that includes origin server for the request - Mapping done if request is service import or requestor has recent subscription - Subscription considered recent if less than 250ms - Destination server strip GW prefix before giving to client and restore when getting a reply on that subject - Mapping removed aftert 250ms - Server rejects client publish on "$GNR." (the new prefix) - Cluster and server hash are now 8 chars long and from base 62 alphabets - Mapped replies need to be sent to leafnode servers due to race (cluster B sends RS+ on GW inbound then RMSG on outbound, the RS+ may be processed later and cluster A may have given message to LN before RS+ on reply subject. So LN needs to accept the mapped reply but will strip to give to client and reassemble before sending it back) Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-06 16:06:49 -07:00
Ivan Kozlovic	75ec78c232	[FIXED] Explicit gateway not using discovered URLs If cluster A configures a gateway to cluster B, the server on A tries to connect to that server URL. If there is no server on B at that address, but a server on B with different address connects to server on cluster A, that server should be able to create its outbound connection in response. That was not the case because the configured URLs were snapshot before the loop of trying to connect. When accepting an inbound connection and updating the array, this new URL was not being used. The issue is only if the server on A had no outbound connection at that time. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-10-24 16:40:38 -06:00
Derek Collison	94f143ccce	Latency tracking updates. Will now breakout the internal NATS latency to show requestor client RTT, responder client RTT and any internal latency caused by hopping between servers, etc. Signed-off-by: Derek Collison <derek@nats.io>	2019-09-11 16:43:19 -07:00
Ivan Kozlovic	cd9f898eb0	Made a server's helper to set first ping timer Defaults to 1sec but will be opts.PingInterval if value is lower. All non client connections invoked this function for the first PING. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-08-26 10:21:43 -06:00
Ivan Kozlovic	c20afd4016	[FIXED] Connection could be closed twice This was introduced in PR#930. The first commit had the route's check if the flushOutbound() returned false, and if so would locally unlock/lock the connection's lock. Unfortunately, this was replaced in the second commit (`a6aeed3a6b`) to the flushOutbound() function itself. This causes the function closeConnection() to possibly unlock the connection while calling flushOutbound(), which if the connection is closed due to both a tls timeout for instance and explicitly, it would result in the connection being scheduled for a reconnect (if explicit gateway connection, possibly route). Added defensive code in Gateway to register a unique outbound gateway. Fixed a test that was now failing with newer Go version in which they fixed url.Parse() Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-08-13 20:11:03 -06:00
Ivan Kozlovic	0a72993d80	Add warning for TLS insecure setting on LeafNodes Also fix for #1071 in that we need to check remote gateways TLS config even if main gateway section is not configured with TLS. Related to #1071 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-07-12 17:22:57 -06:00
Ivan Kozlovic	9e09486e26	Use all caps for the production message Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-07-12 13:44:01 -06:00
Ivan Kozlovic	37d08a6c56	[FIXED] Allow TLS InsecureSkipVerify again This has an effect only on connections created by the server, so routes and gateways (explicit and implicit). Make sure that an explicit warning is printed if the insecure property is set, but otherwise allow it. Resolves #1062 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-07-12 12:10:28 -06:00
Derek Collison	8168aa1f81	Allow sublist cache do be disabled globally Signed-off-by: Derek Collison <derek@nats.io>	2019-07-02 07:34:02 -07:00
Derek Collison	d246359dc8	Merge pull request #1028 from nats-io/leaf_gw_si Bug fix for service import with leafnodes and gws	2019-05-31 11:29:33 -07:00
Derek Collison	3cf6f6a5d2	Bug fix for service import with leafnodes and gws Signed-off-by: Derek Collison <derek@nats.io>	2019-05-31 11:22:02 -07:00
Ivan Kozlovic	37f4e71246	Fixed race due to use of byte slice instead of string The go routine that is started during interest mode switch was using the accName (which was a byte slice) instead of account, which was a string copy of that byte slice. It meant that when printing the notice, the underlying buffer may have be overwriten by the readloop. Changing accName to a string - since we were doing a copy anyway, better change it at the function param level. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-05-30 18:43:01 -06:00
Ivan Kozlovic	37b3546e7b	Switch gateway to InterestMode only once When a leafnode connection is created, the server forces all gateway inbound connections to switch to InterestMode. Do this only once, regardless of how many times the LN (re)connects. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-05-30 17:21:15 -06:00
Ivan Kozlovic	66f5325cee	Merge pull request #1018 from nats-io/gw_log_interest_switch Added logging of account interest mode switch for gateways	2019-05-28 15:33:06 -06:00
Ivan Kozlovic	f5991e8a2b	Merge pull request #1015 from nats-io/restore_conn_error_default_attempts_to_one Update to connect/reconnect error reports logic	2019-05-28 14:57:29 -06:00

1 2

98 Commits