Commit Graph

232 Commits

Author SHA1 Message Date
Ivan Kozlovic
0f53bf6580 Fixed data race with nodeInfo
Took the approach of storing struct instead of pointer. Of course,
when changing the offline bool from false to true, it means that
we need to call Store again (with same key).

This is based on the assumption that those Load/Store are not too
frequent. Otherwise, we may need to use locking (and keep *nodeInfo)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-03 13:28:45 -07:00
Matthias Hanel
4f2db7d187 Fixed linter issues
Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-03-02 20:21:44 -05:00
Derek Collison
1c79d96de8 user single node info struct
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-06 20:10:29 -08:00
Derek Collison
a1e0f7dc1a First pass at supercluster enablement.
This allows metacontrollers to span superclusters. Also includes placement directives for streams. By default they select the request origin cluster.

Signed-off-by: Derek Collison <derek@nats.io>
2021-02-03 17:28:13 -08:00
Ivan Kozlovic
2b8c6e0124 Support for Websocket Leafnode connections
Added two options in the remote leaf node configuration

- compress, for websocket only at the moment
- ws_masking, to force remote leafnode connections to mask websocket
frames (default is no masking since it is communication between
server to server)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-28 13:13:11 -07:00
Ivan Kozlovic
131be1cb33 Make TLS client/server handshake helpers function
This reduces code duplication

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-28 13:13:11 -07:00
Derek Collison
a1730f1b31 Report on RAFT group information.
This adds in optional reporting to stream and consumer info when running in clsutered mode.

Signed-off-by: Derek Collison <derek@nats.io>
2021-01-20 11:58:31 -08:00
Ivan Kozlovic
42dcdd2eb2 Simplify sendSubsToRoute()
Since we were creating subs on the fly, sub.im would always be nil.
We were passing a client because it was needed in sendRouteSubOrUnSubProtos().

This PR simply fills the buffer with each account's subscriptions.
There is also no need to have subs sent from different go routine
based on some threshold. Routes are no longer subject to max pending.

Some code has been made into a function so that they can be shared
by sendSubsToRoute() and sendRouteSubOrUnSubProtos(). The function
is simply adding to given buffer the RS+/- protocol.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-19 14:01:43 -07:00
Ivan Kozlovic
ef38abe75b Fixed gateway reply mapping following changes in JetStream clustering
Those changes are required to maintain backward compatibility.
Since the replies are "_G_.<gateway name hash>.<server ID hash>"
and the hash were 6 characters long, changing to 8 the hash function
would break things.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-01-15 17:32:04 -07:00
Derek Collison
f0cdf89c61 JetStream Clustering WIP
Signed-off-by: Derek Collison <derek@nats.io>
2021-01-14 01:14:52 -08:00
Ivan Kozlovic
67425d23c8 Add c.isMqtt() and c.isWebsocket()
This hides the check on "c.mqtt != nil" or "c.ws != nil".
Added some tests.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-12-02 15:52:06 -07:00
Ivan Kozlovic
77aead807c Send LS- without origin to route
When cluster origin code was added, a server may send LS+ with
an origin cluster name in the protocol. Parsing code from a ROUTER
connection was adjusted to understand this LS+ protocol.
However, the server was also sending an LS- with origin but the
parsing code was not able to understand that. When the unsub was
for a queue subscription, this would cause the parser to error out
and close the route connection.

This PR sends an LS- without the origin in this case (so that tracing
makes sense in term of LS+/LS- sent to a route). The receiving side
then traces appropriate LS- but processes as a normal RS-.

Resolves #1751

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-11-30 13:31:32 -07:00
Ivan Kozlovic
13df1a55fd Changed warning message
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-10-09 09:36:30 -06:00
Ivan Kozlovic
df9d5f5fd9 Accepting route warns if remote server has same name
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-10-08 17:59:33 -06:00
Matthias Hanel
634ce9f7c8 [Adding] Accountz monitoring endpoint and INFO monitoring req subject
Returned imports/exports are formated like jwt exports imports, even if
they originating account is from config.

Fixes #1604

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-09-23 16:22:48 -04:00
Ivan Kozlovic
2ad2bed170 [ADDED] Support for route hostname resolution
We previously simply called DialTimeout() on a route's url when
soliciting. If it resolved to the IP of the host, it would create
a route to self, which server detects, but then would not try again
with other IPs that would have allowed to form a cluster with
other servers running on the other IPs.

This PR keeps track of local IPs + cluster port and exclude them
from the list of IPs returned by LookupHost API. This even prevent
solicitation of routes to self. Only non-local IPs will be tried.

Resolves #1586

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-09-08 13:40:17 -06:00
Phil Pennock
3c680eceb9 Inhibit Go's default TCP keepalive settings for NATS (#1562)
Inhibit Go's default TCP keepalive settings for NATS

Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value.  This affects all TCP listeners.  The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these.

Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol.

Change the tests so we test the same logic.  Do not change HTTP monitoring, profiling, or the websocket API listeners.

Change KeepAlive on client connections too.
2020-08-14 13:37:59 -04:00
Ivan Kozlovic
c620175353 Rework closeConnection()
This change allows the removal of the connection and update of
the server state to be done "in place" but still delay the flushing
of and close of tcp connection to the writeLoop. With ref counting
we ensure that the reconnect happens after the flushing but not
before the state has been updated.

Had to fix some places where we may have called closeConnection()
from under the server lock since it now would deadlock for sure.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-31 15:30:17 -06:00
Ivan Kozlovic
96ccf91566 [FIXED] Possible deadlock with solicited leafnodes when cluster conflict
We cannot call c.closeConnection() under the server lock because
closeConnection() can invoke server lock in some cases.

Created a test that should run without `-race` to reproduce the deadlock
(which it does) but sometimes would fail because cluster would not be
formed. This unconvered an issue with conflict resolution which
test TestRouteClusterNameConflictBetweenStaticAndDynamic() can reproduce
easily. The issue was that we were not updating a dynamic name with
the remote if the remote was non dynamic.

Resolves #1543

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-30 18:45:36 -06:00
Matthias Hanel
3da66ad80d Remove unnecessary account fetch from remote remove functions
Changed: removeReplySub, removeRemoteSubs and processRemoteUnsub

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-07-28 11:00:17 -04:00
Matthias Hanel
946e8415a0 Incorporating review comments 2020-07-27 19:19:43 -04:00
Matthias Hanel
00faefec06 Reduce usage of tmpAccounts to only location where it is needed imports
On import handle it with priority as in non recursive situations, it
won't be present.
2020-07-27 17:38:39 -04:00
Matthias Hanel
37692d2cf9 [Fixed] Skip fetch when a non config based account resolver is used
Resolves #1532

Instead of the fetched account we create a dummy account that is
expired. Any client connecting will trigger a fetch of the actual
account jwt.

This also avoids one fetch, thus the unit test was changed to reflect
this.
Unlike other resolver the memory resolver does not depend on external
systems. It is purely based on server configuration. Therefore, fetch
can be done and not finding an account means there is a configuration issue.
2020-07-27 17:36:55 -04:00
Ivan Kozlovic
9b0967a5d1 [FIXED] Handling of gossiped URLs
If some servers in the cluster have the same connect URLs (due
to the use of client advertise), then it would be possible to
have a server sends the connect_urls INFO update to clients with
missing URLs.

Resolves #1515

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-15 17:39:12 -06:00
Ivan Kozlovic
4d495104de Fixed no_responders use of sendProtoNow()
The call sendProtoNow() should not normally be used (only when
setting up a connection when the writeloop is not yet started and
server needs to send something before being able to start the
writeLoop.

Instead, code should use enqueueProto(). For this particular
case though, use queueOutbound() directly and add to the
producer's pcd map.

Also fixed other places where we were using queueOutbound() +
flushSignal() which is what enqueueProto is doing.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-09 17:55:14 -06:00
Ivan Kozlovic
9288283d90 Fixed accept loops that could leave connections opened
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).

So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.

Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-06 17:03:19 -06:00
Derek Collison
6c805eebc7 Properly support leadnode clusters.
Leafnodes that formed clusters were partially supported. This adds proper support for origin cluster, subscription suppression and data message no echo for the origin cluster.

Signed-off-by: Derek Collison <derek@nats.io>
2020-06-26 09:03:22 -07:00
Derek Collison
c7e4d8b194 Avoid data race on cluster name
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-18 13:17:50 -07:00
Derek Collison
1e52a1007b More updates based on feedback
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-13 08:00:57 -07:00
Derek Collison
146d8f5dcb Updates based on feedback, sped up some slow tests
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 17:26:43 -07:00
Derek Collison
dd61535e5a Cluster names are now required.
Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters.

Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 15:48:38 -07:00
Derek Collison
4dee03b587 Allow mixed TLS and non-TLS on same port
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-05 18:04:11 -07:00
Ivan Kozlovic
dc0f688cbf [FIXED] LameDuckMode sends INFO to clients
Also send an INFO to routes so that the remotes can remove the
LDM's server client URLs and notify their own clients of this
change.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-21 12:15:20 -06:00
Ivan Kozlovic
9715848a8e [ADDED] Websocket support
Websocket support can be enabled with a new websocket
configuration block:

```
websocket {
    # Specify a host and port to listen for websocket connections
    # listen: "host:port"

    # It can also be configured with individual parameters,
    # namely host and port.
    # host: "hostname"
    # port: 4443

    # This will optionally specify what host:port for websocket
    # connections to be advertised in the cluster
    # advertise: "host:port"

    # TLS configuration is required
    tls {
      cert_file: "/path/to/cert.pem"
      key_file: "/path/to/key.pem"
    }

    # If same_origin is true, then the Origin header of the
    # client request must match the request's Host.
    # same_origin: true

    # This list specifies the only accepted values for
    # the client's request Origin header. The scheme,
    # host and port must match. By convention, the
    # absence of port for an http:// scheme will be 80,
    # and for https:// will be 443.
    # allowed_origins [
    #    "http://www.example.com"
    #    "https://www.other-example.com"
    # ]

    # This enables support for compressed websocket frames
    # in the server. For compression to be used, both server
    # and client have to support it.
    # compression: true

    # This is the total time allowed for the server to
    # read the client request and write the response back
    # to the client. This include the time needed for the
    # TLS handshake.
    # handshake_timeout: "2s"
}
```

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-20 11:14:39 -06:00
Derek Collison
019c105ca7 Updates based on feedback, more tests, few bug fixes
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:33:06 -07:00
Derek Collison
f5ceab339a Server support for headers between routes
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:33:06 -07:00
Derek Collison
ea5e5bd364 Services rewrite #2
This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions.

This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum.

This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF.

This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency.

We support the following Status codes, these are directly from HTTP.

400 Bad Request (request did not have a reply subject)
408 Request Timeout (when system detects request interest went away, old request style to make dependable)..
503 Service Unavailable (no service responders running)
504 Service Timeout (The new response threshold expired)

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:26:46 -07:00
Derek Collison
df774e44b0 Rework how service imports are handled to avoid performance hits
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:18:34 -07:00
Derek Collison
8d1f3cc7c2 Allow JetStream consumers to work across multi-server hops
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:16:03 -07:00
Ivan Kozlovic
fef94759ab [FIXED] Update remote gateway URLs when node goes away in cluster
If a node in the cluster goes away, an async INFO is sent to
inbound gateway connections so they get a chance to update their
list of remote gateway URLs. Same happens when a node is added
to the cluster.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-04-20 13:48:47 -06:00
Matthias Hanel
6a1c3fc29b Moving inbound tracing to the caller (client.parse)
Tracing for outgoing operations is always done while
holding the client lock.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 17:31:18 -05:00
Matthias Hanel
fe373ac597 Incorporating comments.
c -> client
defer in oneliner
argument order

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 15:48:19 -05:00
Matthias Hanel
f5bd07b36c [FIXED] trace/debug/sys_log reload will affect existing clients
Fixed #1296, by altering client state on reload

Detect a trace level change on reload and update all clients.
To avoid data races, read client.trace while holding the lock,
pass the value into functionis that trace while not holding the lock.
Delete unused client.debug.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2020-03-04 13:54:15 -05:00
Ivan Kozlovic
c73be88ac0 Updated based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-06 16:57:48 -07:00
Ivan Kozlovic
947798231b [UPDATED] TCP Write and SlowConsumer handling
- All writes will now be done by the writeLoop, unless when the
  writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
  not failed. The idea is that routes, gateway, etc.. connections
  should stay connected as much as possible. However if a flush
  operation times out and no data at all has been written, the
  connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
  This allows sending of SUBs through routes, etc.. to not have
  to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
  based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
  the "fast path".

Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-31 15:06:27 -07:00
Ivan Kozlovic
a22da91647 [FIXED] Closing of Gateway or Route TLS connection may hang
This could happen if the remote server is running but not dequeueing
from the socket. TLS connection Close() may send/read and so we
need to protect with a deadline.

For non client/leaf connection, do not call flushOutbound().
Set the write deadline regardless of handshakeComplete flag, and
set it to a low value.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-04 17:27:00 -07:00
Derek Collison
6ad8287bbe Introduced wildcard handling of _R_ mapped replies.
We had too much special processing, so reduced to a single wildcard
which will propagate across routes and gateways and is consistent
with gateway handling of globally routed subjects and timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
2019-11-16 12:50:53 -08:00
Ivan Kozlovic
d85f9a9388 Fixed bug with duplicate route and GW replies
When a duplicate route is detected and closed, we need to clear
the route's hash in order to prevent the removal from the
server's routeByHash map.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-15 17:24:50 -07:00
Ivan Kozlovic
aa843945c9 Work on Gateways reply mapping
- New prefix that includes origin server for the request
- Mapping done if request is service import or requestor has
  recent subscription
- Subscription considered recent if less than 250ms
- Destination server strip GW prefix before giving to client
  and restore when getting a reply on that subject
- Mapping removed aftert 250ms
- Server rejects client publish on "$GNR." (the new prefix)
- Cluster and server hash are now 8 chars long and from base 62
  alphabets
- Mapped replies need to be sent to leafnode servers due to race
  (cluster B sends RS+ on GW inbound then RMSG on outbound, the
  RS+ may be processed later and cluster A may have given message
  to LN before RS+ on reply subject. So LN needs to accept the
  mapped reply but will strip to give to client and reassemble
  before sending it back)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-06 16:06:49 -07:00
Ivan Kozlovic
d20f76cbaa Merge pull request #1166 from nats-io/add_servername_to_routestat
[ADDED] Server name in the RouteStat for statsz
2019-10-28 13:19:53 -06:00