Took the approach of storing struct instead of pointer. Of course,
when changing the offline bool from false to true, it means that
we need to call Store again (with same key).
This is based on the assumption that those Load/Store are not too
frequent. Otherwise, we may need to use locking (and keep *nodeInfo)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This allows metacontrollers to span superclusters. Also includes placement directives for streams. By default they select the request origin cluster.
Signed-off-by: Derek Collison <derek@nats.io>
Added two options in the remote leaf node configuration
- compress, for websocket only at the moment
- ws_masking, to force remote leafnode connections to mask websocket
frames (default is no masking since it is communication between
server to server)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Since we were creating subs on the fly, sub.im would always be nil.
We were passing a client because it was needed in sendRouteSubOrUnSubProtos().
This PR simply fills the buffer with each account's subscriptions.
There is also no need to have subs sent from different go routine
based on some threshold. Routes are no longer subject to max pending.
Some code has been made into a function so that they can be shared
by sendSubsToRoute() and sendRouteSubOrUnSubProtos(). The function
is simply adding to given buffer the RS+/- protocol.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Those changes are required to maintain backward compatibility.
Since the replies are "_G_.<gateway name hash>.<server ID hash>"
and the hash were 6 characters long, changing to 8 the hash function
would break things.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When cluster origin code was added, a server may send LS+ with
an origin cluster name in the protocol. Parsing code from a ROUTER
connection was adjusted to understand this LS+ protocol.
However, the server was also sending an LS- with origin but the
parsing code was not able to understand that. When the unsub was
for a queue subscription, this would cause the parser to error out
and close the route connection.
This PR sends an LS- without the origin in this case (so that tracing
makes sense in term of LS+/LS- sent to a route). The receiving side
then traces appropriate LS- but processes as a normal RS-.
Resolves#1751
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Returned imports/exports are formated like jwt exports imports, even if
they originating account is from config.
Fixes#1604
Signed-off-by: Matthias Hanel <mh@synadia.com>
We previously simply called DialTimeout() on a route's url when
soliciting. If it resolved to the IP of the host, it would create
a route to self, which server detects, but then would not try again
with other IPs that would have allowed to form a cluster with
other servers running on the other IPs.
This PR keeps track of local IPs + cluster port and exclude them
from the list of IPs returned by LookupHost API. This even prevent
solicitation of routes to self. Only non-local IPs will be tried.
Resolves#1586
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Inhibit Go's default TCP keepalive settings for NATS
Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value. This affects all TCP listeners. The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these.
Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol.
Change the tests so we test the same logic. Do not change HTTP monitoring, profiling, or the websocket API listeners.
Change KeepAlive on client connections too.
This change allows the removal of the connection and update of
the server state to be done "in place" but still delay the flushing
of and close of tcp connection to the writeLoop. With ref counting
we ensure that the reconnect happens after the flushing but not
before the state has been updated.
Had to fix some places where we may have called closeConnection()
from under the server lock since it now would deadlock for sure.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
We cannot call c.closeConnection() under the server lock because
closeConnection() can invoke server lock in some cases.
Created a test that should run without `-race` to reproduce the deadlock
(which it does) but sometimes would fail because cluster would not be
formed. This unconvered an issue with conflict resolution which
test TestRouteClusterNameConflictBetweenStaticAndDynamic() can reproduce
easily. The issue was that we were not updating a dynamic name with
the remote if the remote was non dynamic.
Resolves#1543
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Resolves#1532
Instead of the fetched account we create a dummy account that is
expired. Any client connecting will trigger a fetch of the actual
account jwt.
This also avoids one fetch, thus the unit test was changed to reflect
this.
Unlike other resolver the memory resolver does not depend on external
systems. It is purely based on server configuration. Therefore, fetch
can be done and not finding an account means there is a configuration issue.
If some servers in the cluster have the same connect URLs (due
to the use of client advertise), then it would be possible to
have a server sends the connect_urls INFO update to clients with
missing URLs.
Resolves#1515
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The call sendProtoNow() should not normally be used (only when
setting up a connection when the writeloop is not yet started and
server needs to send something before being able to start the
writeLoop.
Instead, code should use enqueueProto(). For this particular
case though, use queueOutbound() directly and add to the
producer's pcd map.
Also fixed other places where we were using queueOutbound() +
flushSignal() which is what enqueueProto is doing.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).
So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.
Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Leafnodes that formed clusters were partially supported. This adds proper support for origin cluster, subscription suppression and data message no echo for the origin cluster.
Signed-off-by: Derek Collison <derek@nats.io>
Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters.
Signed-off-by: Derek Collison <derek@nats.io>
Also send an INFO to routes so that the remotes can remove the
LDM's server client URLs and notify their own clients of this
change.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Websocket support can be enabled with a new websocket
configuration block:
```
websocket {
# Specify a host and port to listen for websocket connections
# listen: "host:port"
# It can also be configured with individual parameters,
# namely host and port.
# host: "hostname"
# port: 4443
# This will optionally specify what host:port for websocket
# connections to be advertised in the cluster
# advertise: "host:port"
# TLS configuration is required
tls {
cert_file: "/path/to/cert.pem"
key_file: "/path/to/key.pem"
}
# If same_origin is true, then the Origin header of the
# client request must match the request's Host.
# same_origin: true
# This list specifies the only accepted values for
# the client's request Origin header. The scheme,
# host and port must match. By convention, the
# absence of port for an http:// scheme will be 80,
# and for https:// will be 443.
# allowed_origins [
# "http://www.example.com"
# "https://www.other-example.com"
# ]
# This enables support for compressed websocket frames
# in the server. For compression to be used, both server
# and client have to support it.
# compression: true
# This is the total time allowed for the server to
# read the client request and write the response back
# to the client. This include the time needed for the
# TLS handshake.
# handshake_timeout: "2s"
}
```
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions.
This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum.
This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF.
This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency.
We support the following Status codes, these are directly from HTTP.
400 Bad Request (request did not have a reply subject)
408 Request Timeout (when system detects request interest went away, old request style to make dependable)..
503 Service Unavailable (no service responders running)
504 Service Timeout (The new response threshold expired)
Signed-off-by: Derek Collison <derek@nats.io>
If a node in the cluster goes away, an async INFO is sent to
inbound gateway connections so they get a chance to update their
list of remote gateway URLs. Same happens when a node is added
to the cluster.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Fixed#1296, by altering client state on reload
Detect a trace level change on reload and update all clients.
To avoid data races, read client.trace while holding the lock,
pass the value into functionis that trace while not holding the lock.
Delete unused client.debug.
Signed-off-by: Matthias Hanel <mh@synadia.com>
- All writes will now be done by the writeLoop, unless when the
writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
not failed. The idea is that routes, gateway, etc.. connections
should stay connected as much as possible. However if a flush
operation times out and no data at all has been written, the
connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
This allows sending of SUBs through routes, etc.. to not have
to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
the "fast path".
Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This could happen if the remote server is running but not dequeueing
from the socket. TLS connection Close() may send/read and so we
need to protect with a deadline.
For non client/leaf connection, do not call flushOutbound().
Set the write deadline regardless of handshakeComplete flag, and
set it to a low value.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
We had too much special processing, so reduced to a single wildcard
which will propagate across routes and gateways and is consistent
with gateway handling of globally routed subjects and timeouts.
Signed-off-by: Derek Collison <derek@nats.io>
When a duplicate route is detected and closed, we need to clear
the route's hash in order to prevent the removal from the
server's routeByHash map.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
- New prefix that includes origin server for the request
- Mapping done if request is service import or requestor has
recent subscription
- Subscription considered recent if less than 250ms
- Destination server strip GW prefix before giving to client
and restore when getting a reply on that subject
- Mapping removed aftert 250ms
- Server rejects client publish on "$GNR." (the new prefix)
- Cluster and server hash are now 8 chars long and from base 62
alphabets
- Mapped replies need to be sent to leafnode servers due to race
(cluster B sends RS+ on GW inbound then RMSG on outbound, the
RS+ may be processed later and cluster A may have given message
to LN before RS+ on reply subject. So LN needs to accept the
mapped reply but will strip to give to client and reassemble
before sending it back)
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>