nats-server

mirror of https://github.com/gogrlx/nats-server.git synced 2026-04-14 02:07:59 -07:00

Author	SHA1	Message	Date
Ivan Kozlovic	98c1f0ecb2	Fixed some data race and some flappers Got a data race: ``` ================== WARNING: DATA RACE Write at 0x00c001c736b0 by goroutine 605: runtime.mapassign_faststr() /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/runtime/map_faststr.go:202 +0x0 github.com/nats-io/nats-server/v2/server.(Account).addServiceImport() /home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1868 +0xb7b github.com/nats-io/nats-server/v2/server.(Account).AddServiceImportWithClaim() ... Previous read at 0x00c001c736b0 by goroutine 301: runtime.mapaccess2_faststr() /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/runtime/map_faststr.go:107 +0x0 github.com/nats-io/nats-server/v2/server.(Server).registerSystemImports() /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:1577 +0x284 github.com/nats-io/nats-server/v2/server.(Server).updateAccountClaimsWithRefresh() ... ``` Also, remove some condition in gateway.go on how we were checking if a subject was a serviec reply, which was causing a test to flap. Finally, used AckSync() in a rest (instead of m.Respond(nil)) to prevent it from flapping. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-03-29 19:02:41 -06:00
Ivan Kozlovic	63c750e295	[CHANGED] Gateway: Detect duplicate names between clusters Gateway connection will be closed and error reported if a remote has a name that is a duplicate of the local cluster. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-03-15 15:00:13 -06:00
Ivan Kozlovic	b4128693ed	Ensure file path is correct during stream restore Also had to change all references from `path.` to `filepath.` when dealing with files, so that it works properly on Windows. Fixed also lots of tests to defer the shutdown of the server after the removal of the storage, and fixed some config files directories to use the single quote `'` to surround the file path, again to work on Windows. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2022-03-09 13:31:51 -07:00
Ivan Kozlovic	5fc9e0e1cc	[FIXED] Gateway URLs gossip and `/varz` report issues - When detecting duplicate route, it was possible that a server would lose track of the peer's gateway URL, which would prevent it from gossiping that URL to inbound gateway connections - When a server has gateways enabled and has as a remote its own gateway, the monitoring endpoint `/varz` would include it but without the "urls" array. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2021-10-28 12:05:30 -06:00
Phil Pennock	fc6df0fbbc	Redact URLs before logging or returning in error (#2643 ) * Redact URLs before logging or returning in error This does not affect strings which failed to parse, and in such a scenario there's a mix of "which evil" to accept; we can't sanely find what should be redacted in those cases, so we leave them alone for debugging. The JWT library returns some errors for Operator URLs, but it rejects URLs which contain userinfo, so there can't be passwords in those and they're safe. Fixes #2597 * Test the URL redaction auxiliary functions * End-to-end tests for secrets in debug/trace Create internal/testhelper and move DummyLogger there, so it can be used from the test/ sub-dir too. Let DummyLogger optionally accumulate all log messages, not just retain the last-seen message. Confirm no passwords logged by TestLeafNodeBasicAuthFailover. Change TestNoPasswordsFromConnectTrace to check all trace messages, not just the most recent. Validate existing trace redaction in TestRouteToSelf. * Test for password in solicited route reconnect debug	2021-10-27 12:44:59 -04:00
Ivan Kozlovic	a025ce7472	Set defaultServerOptions port to -1 for random Updated some tests based on this change but also missing defer connection close or server shutdown. Fixed how the OCSP run go routine would shutdown, which would never complete because grWG was not decremented by this go routine prior to invoking s.Shutdown() Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2021-09-02 14:22:56 -06:00
Ivan Kozlovic	2881e4a1f0	[FIXED] MQTT fixes and improvements Some issues that have been fixed would manifest by timeouts on connect, unexpected memory usage on high publish message rate. Some details: - Replies were not always GW routed properly because we were looking at the wrong connection's rsubs - GW routed replies would not be found because they were tracked in the subscription's client object, which may not be the same used to send the reply - Increased the mqtt timeout to wait for JS replies since in some tests it was sometimes taking more than the original 2 seconds - Incoming gateway messages destined for an MQTT internal subscription may have been rejected as a no interest if the account had service imports - Don't use time.After(), instead create explicit timer so it can be stopped when not timing out. - Unnecessary copy of a slice since we were converting to a string anyway. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2021-05-04 20:48:14 -06:00
Jaime Piña	27e9628c3a	Run gofmt -s to simplify code	2021-04-09 15:18:06 -07:00
Jaime Piña	d929ee1348	Check errors when removing test directories and files Currently in tests, we have calls to os.Remove and os.RemoveAll where we don't check the returned error. This hides useful error messages when tests fail to run, such as "too many open files". This change checks for more filesystem related errors and calls t.Fatal if there is an error.	2021-04-07 11:09:47 -07:00
Ivan Kozlovic	cbcff97244	[CHANGED] Move Gateway interest-only mode switch from INF to DBG Also fixed a test that would sometimes fail depending on timing. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2021-03-14 11:34:36 -06:00
Ivan Kozlovic	8598de6dbe	[FIXED] Gateway's implicit connection not using global user/pass If a gateway is configured with an authorization block containing username and password and accepts an unknown Gateway connection, when initiating the outbound connection, it should use the gateway authorization's user/pass information. Resolves #1912 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2021-02-16 10:06:06 -07:00
Derek Collison	f0cdf89c61	JetStream Clustering WIP Signed-off-by: Derek Collison <derek@nats.io>	2021-01-14 01:14:52 -08:00
Ivan Kozlovic	d24e9b75b3	Fixed GW implicit reconnection PR #1412 had a fix for races during implicit GW reconnection. However, the fix was a bit too simplistic in that it was checking only if there was any inbound gateway to decide to try to reconnect an implicit disconnected GW. We need to check the name, not only presence of inbound GW connections. Related to #1412 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-12-28 12:28:55 -07:00
Ivan Kozlovic	fc1521636c	[FIXED] Config reload for gateways/leaf remote TLS configurations Presence of TLS config in any remote gateway or leafnode would cause the config reload to fail (because TLS config internal content may change which fails the DeepEqual check). This PR excludes the TLS configs in such case to check for changes in gateways and leafnodes. Although GW and LN config reload is technically supported, this PR updates the internal remotes' TLS configuration so that changes/updates to TLS certificates would take effect after a configuration reload. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-12-11 16:56:25 -07:00
Ivan Kozlovic	ffd476357e	[CHANGED] Gateway connections now always send PINGs Connections normally suppress sending PINGs if there was some activity. We now force GATEWAY connections to send PINGs at the configured interval or 15 seconds, whichever is the smallest. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-11-03 13:13:09 -07:00
Ivan Kozlovic	df9d5f5fd9	Accepting route warns if remote server has same name Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-10-08 17:59:33 -06:00
Phil Pennock	3c680eceb9	Inhibit Go's default TCP keepalive settings for NATS (#1562 ) Inhibit Go's default TCP keepalive settings for NATS Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value. This affects all TCP listeners. The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these. Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol. Change the tests so we test the same logic. Do not change HTTP monitoring, profiling, or the websocket API listeners. Change KeepAlive on client connections too.	2020-08-14 13:37:59 -04:00
Ivan Kozlovic	9288283d90	Fixed accept loops that could leave connections opened This was discovered with the test TestLeafNodeWithGatewaysServerRestart that was sometimes failing. Investigation showed that when cluster B was shutdown, one of the server on A that had a connection from B that just broke tried to reconnect (as part of reconnect retries of implicit gateways) to a server in B that was in the process of shuting down. The connection had been accepted but createGateway not called because the server's running boolean had been set to false as part of the shutdown. However, the connection was not closed so the server on A had a valid connection to a dead server from cluster B. When the B cluster (now single server) was restarted and a LeafNode connection connected to it, then the gateway from B to A was created, that server on A did not create outbound connection to that B server because it already had one (the zombie one). So this PR strengthens the starting of accept loops and also make sure that if a connection (all type of connections) is not accepted because the server is shuting down, that connection is properly closed. Since all accept loops had almost same code, made a generic function that accept functions to call specific create connection functions. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-07-06 17:03:19 -06:00
Derek Collison	2b9e3e5b15	Merge pull request #1476 from nats-io/cluster_name Cluster names are now required.	2020-06-15 10:07:30 -07:00
Derek Collison	146d8f5dcb	Updates based on feedback, sped up some slow tests Signed-off-by: Derek Collison <derek@nats.io>	2020-06-12 17:26:43 -07:00
Derek Collison	dd61535e5a	Cluster names are now required. Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters. Signed-off-by: Derek Collison <derek@nats.io>	2020-06-12 15:48:38 -07:00
Ivan Kozlovic	d6de05f49a	Fixed a test with data race Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-06-12 13:04:05 -06:00
Ivan Kozlovic	b9bd5c2d35	Fixed flappers Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-06-09 15:34:52 -06:00
Derek Collison	2bd7553c71	System Account on by default. Most of the changes are to turn it off for tests that were watching subscriptions and such. Signed-off-by: Derek Collison <derek@nats.io>	2020-05-29 17:56:45 -07:00
Ivan Kozlovic	e976e63099	Fixing some flappers Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-05-25 06:58:23 -07:00
Ivan Kozlovic	5dba3cdd75	[FIXED] Race condition during implicit Gateway reconnection Say server in cluster A accepts a connection from a server in cluster B. The gateway is implicit, in that A does not have a configured remote gateway to B. Then the server in B is shutdown, which A detects and initiate a single reconnect attempt (since it is implicit and if the reconnect retries is not set). While this happens, a new server in B is restarted and connects to A. If that happens before the initial reconnect attempt failed, A will register that new inbound and do not attempt to solicit because it has already a remote entry for gateway B. At this point when the reconnect to old server B fails, then the remote GW entry is removed, and A will not create an outbound connection to the new B server. We fix that by checking if there is a registered inbound when we get to the point of removing the remote on a failed implicit reconnect. If there is one, we try the reconnection. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-05-22 13:01:17 -06:00
Ivan Kozlovic	9715848a8e	[ADDED] Websocket support Websocket support can be enabled with a new websocket configuration block: ``` websocket { # Specify a host and port to listen for websocket connections # listen: "host:port" # It can also be configured with individual parameters, # namely host and port. # host: "hostname" # port: 4443 # This will optionally specify what host:port for websocket # connections to be advertised in the cluster # advertise: "host:port" # TLS configuration is required tls { cert_file: "/path/to/cert.pem" key_file: "/path/to/key.pem" } # If same_origin is true, then the Origin header of the # client request must match the request's Host. # same_origin: true # This list specifies the only accepted values for # the client's request Origin header. The scheme, # host and port must match. By convention, the # absence of port for an http:// scheme will be 80, # and for https:// will be 443. # allowed_origins [ # "http://www.example.com" # "https://www.other-example.com" # ] # This enables support for compressed websocket frames # in the server. For compression to be used, both server # and client have to support it. # compression: true # This is the total time allowed for the server to # read the client request and write the response back # to the client. This include the time needed for the # TLS handshake. # handshake_timeout: "2s" } ``` Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-05-20 11:14:39 -06:00
Derek Collison	0129a7fa09	Header support for GWs Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:33:56 -07:00
Derek Collison	a5dbc20e94	Fix for test failure Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:26:46 -07:00
Derek Collison	ea5e5bd364	Services rewrite #2 This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions. This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum. This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF. This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency. We support the following Status codes, these are directly from HTTP. 400 Bad Request (request did not have a reply subject) 408 Request Timeout (when system detects request interest went away, old request style to make dependable).. 503 Service Unavailable (no service responders running) 504 Service Timeout (The new response threshold expired) Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:26:46 -07:00
Derek Collison	df774e44b0	Rework how service imports are handled to avoid performance hits Signed-off-by: Derek Collison <derek@nats.io>	2020-05-19 14:18:34 -07:00
Ivan Kozlovic	fef94759ab	[FIXED] Update remote gateway URLs when node goes away in cluster If a node in the cluster goes away, an async INFO is sent to inbound gateway connections so they get a chance to update their list of remote gateway URLs. Same happens when a node is added to the cluster. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-04-20 13:48:47 -06:00
Ivan Kozlovic	8e4b449119	Fixed flappers Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-02-19 13:19:08 -07:00
Ivan Kozlovic	bd28a015b1	[FIXED] Sublist isSubsetMatch to handle empty tokens If a subject has empty tokens, returns false. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-01-14 18:28:14 -07:00
Ivan Kozlovic	c097357b52	[FIXED] More than expected switch to Interest-Only mode for account When an account is switched to interest-only mode due to no interest, it was not possible to switch that account more than once. But the function switchAccountToInterestMode() that triggers a switch could possibly doing it more than once. This should not cause problems but increased the number of traces in a big super cluster. Also fixed some flappers and a data race. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2020-01-09 13:35:08 -07:00
Ivan Kozlovic	947798231b	[UPDATED] TCP Write and SlowConsumer handling - All writes will now be done by the writeLoop, unless when the writeLoop has not been started yet (likely in connection init). - Slow consumers for non CLIENT connections will be reported but not failed. The idea is that routes, gateway, etc.. connections should stay connected as much as possible. However if a flush operation times out and no data at all has been written, the connection will be closed (regardless of type). - Slow consumers due to max pending is only for CLIENT connections. This allows sending of SUBs through routes, etc.. to not have to be chunked. - The backpressure to CLIENT connections is increased (up to 1sec) based on the sub's connection pending bytes level. - Connection is flushed on close from the writeLoop as to not block the "fast path". Some tests have been fixed and adapted since now closeConnection() is not flushing/closing/removing connection in place. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-12-31 15:06:27 -07:00
Ivan Kozlovic	a22da91647	[FIXED] Closing of Gateway or Route TLS connection may hang This could happen if the remote server is running but not dequeueing from the socket. TLS connection Close() may send/read and so we need to protect with a deadline. For non client/leaf connection, do not call flushOutbound(). Set the write deadline regardless of handshakeComplete flag, and set it to a low value. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-12-04 17:27:00 -07:00
Ivan Kozlovic	a0f8bd112e	[FIXED] Prevent A- for account that has service reply subscription Prevent sending an A- for a given account if the server has this account registered and an internal service reply subscription. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-26 16:21:36 -07:00
Ivan Kozlovic	63138509f7	Tune some code/test for Windows Running test suite on a Windows VM, I notice several failures. Updated the compute of the RTT to be at least 1ns. I think that this is just an issue with the VM I am running, but that change will have no impact for normal situations (since setting the rtt to the very minimum duration (1ns) instead of 0) and will prevent some tests from failing. Because of those same timer granularity issues, I had to add some delays between some actions in order for time.Sub()/Since() to actually report something more than 0. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-21 14:32:46 -07:00
Derek Collison	6ad8287bbe	Introduced wildcard handling of _R_ mapped replies. We had too much special processing, so reduced to a single wildcard which will propagate across routes and gateways and is consistent with gateway handling of globally routed subjects and timeouts. Signed-off-by: Derek Collison <derek@nats.io>	2019-11-16 12:50:53 -08:00
Ivan Kozlovic	b561bde366	Alternate approach to GW reply mapping expiration Use centralized sync map to gather *client that have GW replies. Tested with concurrent receiving clients and perf is as good as with timer per client but reduces need of that timer per client object. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-11 13:36:24 -07:00
Ivan Kozlovic	cacfb4a08c	Fix some gateway tests Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-08 19:07:57 -07:00
Ivan Kozlovic	9b7dab0548	Updates based on code review - Add atomic in client to skip check in processInboundClientMsg() if value is 0. Avoids getting the lock in fast path if not needed. - Have a timer per client instead of the global server list that was expiring: noticed a lot of contention there when running some perf/profiling tests. The timer is also not reset for every timestamp that is not yet expired since this too affects performance. Instead fires are regular interval and cleared when map is empty after a cycle. - Move processing of gw map rely on its own function (in inbound msg). I have verified that this is inlined same way as when code was directly in processInboundClientMsg. - Use string(subj[]) for prefix detection: I have verified that it is actually faster. - Builds the RMSG with appends to local buffer in handleGatewayReply() instead of using fmt.Sprintf(). Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-08 15:56:28 -07:00
Ivan Kozlovic	aa843945c9	Work on Gateways reply mapping - New prefix that includes origin server for the request - Mapping done if request is service import or requestor has recent subscription - Subscription considered recent if less than 250ms - Destination server strip GW prefix before giving to client and restore when getting a reply on that subject - Mapping removed aftert 250ms - Server rejects client publish on "$GNR." (the new prefix) - Cluster and server hash are now 8 chars long and from base 62 alphabets - Mapped replies need to be sent to leafnode servers due to race (cluster B sends RS+ on GW inbound then RMSG on outbound, the RS+ may be processed later and cluster A may have given message to LN before RS+ on reply subject. So LN needs to accept the mapped reply but will strip to give to client and reassemble before sending it back) Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-11-06 16:06:49 -07:00
Ivan Kozlovic	75ec78c232	[FIXED] Explicit gateway not using discovered URLs If cluster A configures a gateway to cluster B, the server on A tries to connect to that server URL. If there is no server on B at that address, but a server on B with different address connects to server on cluster A, that server should be able to create its outbound connection in response. That was not the case because the configured URLs were snapshot before the loop of trying to connect. When accepting an inbound connection and updating the array, this new URL was not being used. The issue is only if the server on A had no outbound connection at that time. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-10-24 16:40:38 -06:00
Ivan Kozlovic	77c63dbce1	Fix flappers Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-08-20 17:07:22 -06:00
Ivan Kozlovic	2f48ad5150	Fixed subscription close I noticed that TestNoRaceRoutedQueueAutoUnsubscribe started to fail a lot on Travis. Running locally I could see a 45 to 50% failures. After investigation I realized that the issue was that we have wrongly re-used `subscription.nm` and set to -1 on unsubscribe however, I believe that it was possible that when subscription was closed, the server may have already picked that consumer for a delivery which then causes nm==-1 to be bumped to 0, which was wrong. Commenting out the subscription.close() that sets nm to -1, I could not get the test to fail on macOS but would still get 7% failure on Linux VM. Adding the check to see if sub is closed in deliverMsg() completely erase the failures, even on Linux VM. We could still use `nm` set to -1 but check on deliverMsg(), the same way I use the closed int32 now. Fixed some flappers. Updated .travis.yml to failfast if one of the command in the `script` fails. User `set -e` and `set +e` as recommended in https://github.com/travis-ci/travis-ci/issues/1066 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-08-20 14:39:23 -06:00
Ivan Kozlovic	c20afd4016	[FIXED] Connection could be closed twice This was introduced in PR#930. The first commit had the route's check if the flushOutbound() returned false, and if so would locally unlock/lock the connection's lock. Unfortunately, this was replaced in the second commit (`a6aeed3a6b`) to the flushOutbound() function itself. This causes the function closeConnection() to possibly unlock the connection while calling flushOutbound(), which if the connection is closed due to both a tls timeout for instance and explicitly, it would result in the connection being scheduled for a reconnect (if explicit gateway connection, possibly route). Added defensive code in Gateway to register a unique outbound gateway. Fixed a test that was now failing with newer Go version in which they fixed url.Parse() Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-08-13 20:11:03 -06:00
Ivan Kozlovic	ed1901c792	Update go.mod to satisfy v2 requirements Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-06-03 19:45:47 -06:00
Ivan Kozlovic	37b3546e7b	Switch gateway to InterestMode only once When a leafnode connection is created, the server forces all gateway inbound connections to switch to InterestMode. Do this only once, regardless of how many times the LN (re)connects. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>	2019-05-30 17:21:15 -06:00

1 2

83 Commits