Commit Graph

123 Commits

Author SHA1 Message Date
Derek Collison
f13fa767c2 Remove the swapping of accounts during processing of service imports.
When processing service imports we would swap out the accounts during processing.
With the addition of internal subscriptions and internal clients publishing in JetStream we had an issue with the wrong account being used.
This was specific to delyaed pull subscribers trying to unsubscribe due to max of 1 while other JetStream API calls were running concurrently.
2021-07-26 07:57:10 -07:00
Derek Collison
1270977322 When receiving a response across a gateway that has headers and a globally routed subject (_GR_) we were dropping header information.
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-10 14:29:33 -07:00
Matthias Hanel
b1dee292e6 [changed] pinned certs to check the server connected to as well (#2247)
* [changed] pinned certs to check the server connected to as well

on reload clients with removed pinned certs will be disconnected.
The check happens only on tls handshake now.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-05-24 17:28:32 -04:00
Matthias Hanel
6f6f22e9a7 [added] pinned_cert option to tls block hex(sha256(spki)) (#2233)
* [added] pinned_cert option to tls block hex(sha256(spki))

When read form config, the values are automatically lower cased.
The check when seeing the values programmatically requires 
lower case to avoid having to alter the map at this point.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-05-20 17:00:09 -04:00
Ivan Kozlovic
2881e4a1f0 [FIXED] MQTT fixes and improvements
Some issues that have been fixed would manifest by timeouts on
connect, unexpected memory usage on high publish message rate.

Some details:
- Replies were not always GW routed properly because we were looking
at the wrong connection's rsubs
- GW routed replies would not be found because they were tracked
in the subscription's client object, which may not be the same used
to send the reply
- Increased the mqtt timeout to wait for JS replies since in some
tests it was sometimes taking more than the original 2 seconds
- Incoming gateway messages destined for an MQTT internal subscription
may have been rejected as a no interest if the account had service imports
- Don't use time.After(), instead create explicit timer so it can
be stopped when not timing out.
- Unnecessary copy of a slice since we were converting to a string anyway.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-05-04 20:48:14 -06:00
Jaime Piña
e12181cb83 Return not ready for connection reason
Currently, we use ReadyForConnections in server tests to wait for the
server to be ready. However, when this fails we don't get a clue about
why it failed.

This change adds a new unexported method called readyForConnections that
returns an error describing which check failed. The exported
ReadyForConnections version works exactly as before. The unexported
version gets used in internal tests only.
2021-04-20 11:45:08 -07:00
Ivan Kozlovic
56d0d9ec87 Do not propagate service import interest across GW and ROUTES
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-04-15 11:34:36 -06:00
Derek Collison
8eefff2b3b Make sure the jetstream accounts use the name as the key to the map.
This prevents possible double adds under reload or restart scenarios.

Signed-off-by: Derek Collison <derek@nats.io>
2021-03-18 17:29:26 -07:00
Ivan Kozlovic
cbcff97244 [CHANGED] Move Gateway interest-only mode switch from INF to DBG
Also fixed a test that would sometimes fail depending on timing.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-14 11:34:36 -06:00
Ivan Kozlovic
27f51d4028 Fix ephemeral consumer delete in single cluster
Also remove retry of sources/mirror in the setSourceConsumer() itself
when not getting a response.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-10 15:16:31 -07:00
Ivan Kozlovic
e7e756034a Switch Gateway JS accounts to interest-only mode + some other fixes
- Fixed the close of a TLS connection which starting Go 1.16
set the deadline to 5 seconds.

- Fixed an issue with setHeader that was causing these error messages
```
=== RUN   TestServiceImportReplyMatchCycleMultiHops
nats: message could not decode headers on connection [4] for subscription on "foo"
--- PASS: TestServiceImportReplyMatchCycleMultiHops (0.04s)
```

- Fixed names of tests in norace_test.go since they must start with
TestNoRace in order to make sure that we execute them in Travis:
```
go test -v -run=TestNoRace --failfast -p=1 ./...
```

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-03 19:15:28 -07:00
Matthias Hanel
c50ee2a1c6 [Changed] all times exposed will be computed in UTC (#1943)
This also applies to times that end up in that json.
Where applicable moved time.Now() to where it is used.
Moved calls to .UTC() to where time is created it that time is converted
later anyway.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-03-02 21:37:42 -05:00
Ivan Kozlovic
1652fe62ef Updates to when do snapshot
Remove panic on runAsLeader when not able to subscribe (which happens
on shutdown)
Gateway name access does not need lock since it is immutable. Will
prevent deadlocks in some situations.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-02-23 19:06:07 -07:00
Derek Collison
bb58d455f6 Revert switching to interest only mode
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-23 18:00:47 -08:00
Derek Collison
6d6a6c07ff Don't send empty subjects, always put system account in interest only
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-23 10:57:12 -08:00
Derek Collison
fa8a74ceb5 Allow placement directives for metacontroller stepdown to allow placement to new clusters.
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-19 10:55:22 -08:00
Ivan Kozlovic
61bd1b8d86 MQTT clustering
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-02-19 08:50:00 -07:00
Ivan Kozlovic
8598de6dbe [FIXED] Gateway's implicit connection not using global user/pass
If a gateway is configured with an authorization block containing
username and password and accepts an unknown Gateway connection,
when initiating the outbound connection, it should use the
gateway authorization's user/pass information.

Resolves #1912

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-02-16 10:06:06 -07:00
Derek Collison
6d32c307ef Remove pretty indent for json.
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-06 20:09:44 -08:00
Ivan Kozlovic
2b8c6e0124 Support for Websocket Leafnode connections
Added two options in the remote leaf node configuration

- compress, for websocket only at the moment
- ws_masking, to force remote leafnode connections to mask websocket
frames (default is no masking since it is communication between
server to server)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-28 13:13:11 -07:00
Ivan Kozlovic
131be1cb33 Make TLS client/server handshake helpers function
This reduces code duplication

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-28 13:13:11 -07:00
Ivan Kozlovic
ef38abe75b Fixed gateway reply mapping following changes in JetStream clustering
Those changes are required to maintain backward compatibility.
Since the replies are "_G_.<gateway name hash>.<server ID hash>"
and the hash were 6 characters long, changing to 8 the hash function
would break things.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-15 17:32:04 -07:00
Derek Collison
f0cdf89c61 JetStream Clustering WIP
Signed-off-by: Derek Collison <derek@nats.io>
2021-01-14 01:14:52 -08:00
Ivan Kozlovic
d24e9b75b3 Fixed GW implicit reconnection
PR #1412 had a fix for races during implicit GW reconnection.
However, the fix was a bit too simplistic in that it was checking
only if there was any inbound gateway to decide to try to reconnect
an implicit disconnected GW. We need to check the name, not only
presence of inbound GW connections.

Related to #1412

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-12-28 12:28:55 -07:00
Ivan Kozlovic
fc1521636c [FIXED] Config reload for gateways/leaf remote TLS configurations
Presence of TLS config in any remote gateway or leafnode would
cause the config reload to fail (because TLS config internal
content may change which fails the DeepEqual check).

This PR excludes the TLS configs in such case to check for
changes in gateways and leafnodes.

Although GW and LN config reload is technically supported, this
PR updates the internal remotes' TLS configuration so that
changes/updates to TLS certificates would take effect after
a configuration reload.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-12-11 16:56:25 -07:00
Ivan Kozlovic
ffd476357e [CHANGED] Gateway connections now always send PINGs
Connections normally suppress sending PINGs if there was some
activity. We now force GATEWAY connections to send PINGs at the
configured interval or 15 seconds, whichever is the smallest.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-11-03 13:13:09 -07:00
Ivan Kozlovic
2ad2bed170 [ADDED] Support for route hostname resolution
We previously simply called DialTimeout() on a route's url when
soliciting. If it resolved to the IP of the host, it would create
a route to self, which server detects, but then would not try again
with other IPs that would have allowed to form a cluster with
other servers running on the other IPs.

This PR keeps track of local IPs + cluster port and exclude them
from the list of IPs returned by LookupHost API. This even prevent
solicitation of routes to self. Only non-local IPs will be tried.

Resolves #1586

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-09-08 13:40:17 -06:00
Phil Pennock
3c680eceb9 Inhibit Go's default TCP keepalive settings for NATS (#1562)
Inhibit Go's default TCP keepalive settings for NATS

Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value.  This affects all TCP listeners.  The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these.

Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol.

Change the tests so we test the same logic.  Do not change HTTP monitoring, profiling, or the websocket API listeners.

Change KeepAlive on client connections too.
2020-08-14 13:37:59 -04:00
Ivan Kozlovic
b9764db478 Renamed gossipURLs type and moved its declaration to util.go
Also made the add/remove/getAsStringSlice receiver for this type.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-16 11:22:58 -06:00
Ivan Kozlovic
9b0967a5d1 [FIXED] Handling of gossiped URLs
If some servers in the cluster have the same connect URLs (due
to the use of client advertise), then it would be possible to
have a server sends the connect_urls INFO update to clients with
missing URLs.

Resolves #1515

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-15 17:39:12 -06:00
Ivan Kozlovic
4d495104de Fixed no_responders use of sendProtoNow()
The call sendProtoNow() should not normally be used (only when
setting up a connection when the writeloop is not yet started and
server needs to send something before being able to start the
writeLoop.

Instead, code should use enqueueProto(). For this particular
case though, use queueOutbound() directly and add to the
producer's pcd map.

Also fixed other places where we were using queueOutbound() +
flushSignal() which is what enqueueProto is doing.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-09 17:55:14 -06:00
Ivan Kozlovic
9288283d90 Fixed accept loops that could leave connections opened
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).

So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.

Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-06 17:03:19 -06:00
Derek Collison
120402241a Fix for #1486
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-18 21:04:34 -07:00
Derek Collison
4dee03b587 Allow mixed TLS and non-TLS on same port
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-05 18:04:11 -07:00
Ivan Kozlovic
5dba3cdd75 [FIXED] Race condition during implicit Gateway reconnection
Say server in cluster A accepts a connection from a server in
cluster B.
The gateway is implicit, in that A does not have a configured
remote gateway to B.
Then the server in B is shutdown, which A detects and initiate
a single reconnect attempt (since it is implicit and if the
reconnect retries is not set).
While this happens, a new server in B is restarted and connects
to A. If that happens before the initial reconnect attempt
failed, A will register that new inbound and do not attempt to
solicit because it has already a remote entry for gateway B.
At this point when the reconnect to old server B fails, then
the remote GW entry is removed, and A will not create an outbound
connection to the new B server.

We fix that by checking if there is a registered inbound when
we get to the point of removing the remote on a failed implicit
reconnect. If there is one, we try the reconnection.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-22 13:01:17 -06:00
Derek Collison
0129a7fa09 Header support for GWs
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:33:56 -07:00
Derek Collison
ea5e5bd364 Services rewrite #2
This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions.

This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum.

This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF.

This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency.

We support the following Status codes, these are directly from HTTP.

400 Bad Request (request did not have a reply subject)
408 Request Timeout (when system detects request interest went away, old request style to make dependable)..
503 Service Unavailable (no service responders running)
504 Service Timeout (The new response threshold expired)

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:26:46 -07:00
Derek Collison
df774e44b0 Rework how service imports are handled to avoid performance hits
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:18:34 -07:00
Derek Collison
8d1f3cc7c2 Allow JetStream consumers to work across multi-server hops
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:16:03 -07:00
Derek Collison
0c2d539b06 Remote request API
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:13:22 -07:00
Derek Collison
0fb7ee32bc Auto-expiration of ephemeral push based observables
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:07:02 -07:00
Ivan Kozlovic
fef94759ab [FIXED] Update remote gateway URLs when node goes away in cluster
If a node in the cluster goes away, an async INFO is sent to
inbound gateway connections so they get a chance to update their
list of remote gateway URLs. Same happens when a node is added
to the cluster.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-04-20 13:48:47 -06:00
Derek Collison
82f585d83a Updated to also resend leafnode connect on GW connect via first INFO
Signed-off-by: Derek Collison <derek@nats.io>
2020-04-08 09:55:19 -07:00
Matthias Hanel
6a1c3fc29b Moving inbound tracing to the caller (client.parse)
Tracing for outgoing operations is always done while
holding the client lock.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 17:31:18 -05:00
Matthias Hanel
fe373ac597 Incorporating comments.
c -> client
defer in oneliner
argument order

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 15:48:19 -05:00
Matthias Hanel
f5bd07b36c [FIXED] trace/debug/sys_log reload will affect existing clients
Fixed #1296, by altering client state on reload

Detect a trace level change on reload and update all clients.
To avoid data races, read client.trace while holding the lock,
pass the value into functionis that trace while not holding the lock.
Delete unused client.debug.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 13:54:15 -05:00
Ivan Kozlovic
47b08335a4 [FIXED] Reset of tlsName only for x509.HostnameError
For issue #1256, we cleared the possibly saved tlsName on Hanshake failure.
However, this meant that for normal use cases, if a reconnect failed for
any reason we would not be able to reconnect if it is an IP until we get
back to the URL that contained the hostname.

We now clear only if the handshake error is of x509.HostnameError type,
which include errors such as:
```
"x509: Common Name is not a valid hostname: <x>"
"x509: cannot validate certificate for <x> because it doesn't contain any IP SANs"
"x509: certificate is not valid for any names, but wanted to match <x>"
"x509: certificate is valid for <x>, not <y>"
```

Applied the same logic to solicited gateway connections, and fixed the fact
that the tlsConfig should be cloned (since we set the ServerName).

I have also made a change for leafnode connections similar to what we are
doing for gateway connections, which is to use the saved tlsName only if
tlsConfig.ServerName is empty, which may not be the case for users that
embed NATS Server and pass directly tls configuration. In other words,
if the option TLSConfig.ServerName is not empty, always use this value.

Relates to #1256

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-28 13:16:38 -07:00
Ivan Kozlovic
c097357b52 [FIXED] More than expected switch to Interest-Only mode for account
When an account is switched to interest-only mode due to no interest,
it was not possible to switch that account more than once. But the
function switchAccountToInterestMode() that triggers a switch could
possibly doing it more than once. This should not cause problems
but increased the number of traces in a big super cluster.

Also fixed some flappers and a data race.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-09 13:35:08 -07:00
Ivan Kozlovic
c73be88ac0 Updated based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-06 16:57:48 -07:00
Ivan Kozlovic
947798231b [UPDATED] TCP Write and SlowConsumer handling
- All writes will now be done by the writeLoop, unless when the
  writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
  not failed. The idea is that routes, gateway, etc.. connections
  should stay connected as much as possible. However if a flush
  operation times out and no data at all has been written, the
  connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
  This allows sending of SUBs through routes, etc.. to not have
  to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
  based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
  the "fast path".

Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-31 15:06:27 -07:00