We previously simply called DialTimeout() on a route's url when
soliciting. If it resolved to the IP of the host, it would create
a route to self, which server detects, but then would not try again
with other IPs that would have allowed to form a cluster with
other servers running on the other IPs.
This PR keeps track of local IPs + cluster port and exclude them
from the list of IPs returned by LookupHost API. This even prevent
solicitation of routes to self. Only non-local IPs will be tried.
Resolves#1586
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The new option Websocket.NoTLS would have to be set to true
to disable the server check that enforces TLS configuration.
Resolves#1529
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
If some servers in the cluster have the same connect URLs (due
to the use of client advertise), then it would be possible to
have a server sends the connect_urls INFO update to clients with
missing URLs.
Resolves#1515
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).
So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.
Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters.
Signed-off-by: Derek Collison <derek@nats.io>
The grace period used to be hardcoded at 10 seconds.
This option allows the user to configure the amount of time the
server will wait before initiating the closing of client connections.
Note that the grace period needs to be strictly lower than the overall
lame_duck_duration. The server deducts the grace period from that
overall duration and spreads the closing of connections during
that time.
For instance, if there are 1000 connections and the lame duck
duration is set to 30 seconds and grace period to 10, then
the server will use 30-10 = 20 seconds to spread the closing
of those 1000 connections, so say roughly 50 clients per second.
Resolves#1459.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is related to #1408.
Make sure that we close the websocket "accept loop" if configured
before proceeding with the lame duck mode.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
- A race test may have consumed a lot of fds going in TIME_WAIT
that could cause some issues for other tests
- Missing defer filestore.Stop() that would leave flushLoop()
routines
- A defer for the from server in a LeafNode test
- Rework [Re]ConnectErrorReports that was failing often for me
locally (probably due to exhaustion of fds - too many TIME_WAIT).
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Server was incorrectly processing a queue subscription removal
as both a plain sub and queue sub, which may have resulted in
drop of interest even when some queue subs remained.
Resolves#1421
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This will ensure that there is no race where clients are accepted
after the LDM INFO notification.
Also add to the test to make sure that we don't send INFO when
routes are disconnected due to internal closing of connections
during the shutdown process.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Also send an INFO to routes so that the remotes can remove the
LDM's server client URLs and notify their own clients of this
change.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When no tag was set, os.Getenv("TRAVIS_TAG") would return empty string.
Travis now set TRAVIS_TAG to `''`. So check for both.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Updated all tests that use "async" clients.
- start the writeLoop (this is in preparation for changes in the
server that will not do send-in-place for some protocols, such
as PING, etc..)
- Added missing defers in several tests
- fixed an issue in client.go where test was wrong possibly causing
a panic.
- Had to skip a test for now since it would fail without server code
change.
The next step will be ensure that all protocols are sent through
the writeLoop and that the data is properly flushed on close (important
for -ERR for instance).
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Running test suite on a Windows VM, I notice several failures.
Updated the compute of the RTT to be at least 1ns. I think that
this is just an issue with the VM I am running, but that change
will have no impact for normal situations (since setting the rtt
to the very minimum duration (1ns) instead of 0) and will prevent
some tests from failing.
Because of those same timer granularity issues, I had to add some
delays between some actions in order for time.Sub()/Since() to
actually report something more than 0.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This adds a new config option server_name that
when set will be exposed in varz, events and more
as a descriptive name for the server.
If unset though the server_name will default to the pk
Signed-off-by: R.I.Pienaar <rip@devco.net>
Also fix for #1071 in that we need to check remote gateways TLS
config even if main gateway section is not configured with TLS.
Related to #1071
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This has an effect only on connections created by the server,
so routes and gateways (explicit and implicit).
Make sure that an explicit warning is printed if the insecure
property is set, but otherwise allow it.
Resolves#1062
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Changed the introduced new option and added a new one. The idea
is to be able to differentiate between never connected and reconnected
event. The never connected situation will be logged at first attempt
and every hour (by default, configurable).
However, once connected and if trying to reconnect, will report every
attempts by default, but this is configurable too.
These two options are supported for config reload.
Related to #1000
Related to #1001Resolves#969
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is a continuation of #1000. Added a configuration to specify
the number of attempts at which the repeated error is reported.
The algo is now to print only the 1st attempt and when current
attempt % <this config param> == 0.
Resolves#969
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This applies to routes, gateways and leaf node connections.
The failed attempts will be printed at the first, after the first
minute and then every hour.
The connect/error statements now include the attempt number.
Note that in debug mode, all attempts are traced, so you may get
double trace (one for debug, one for info/error).
Resolves#969
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Based on @softkbot PR #913.
Removed the command line parameter, which then removes the need for Options.Cluster.TLSInsecure.
Added a test with config reload.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
There is a possibility that a partial write results in data
not being sent in a timely fashion to a subscription.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is not complete solution and is a bit hacky but is a start
to be able to have service import work at least in some basic
cases.
Also fixed a bug where replySub would not be removed from
connection's list of subs after delivery.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
- Solve RS+ with wildcards
- Solve issue with messages not send to remote gateways queue subs
if there was a qsub on local server.
- Made rcache a perAccountCache since it is now used by routes and
gateways
- Order outbound gateways only on RTT updates
- Print a server's gateway name on startup
- Augment/add some tests
- Update TLS handling: when connecting, use hostname for ServerName
if url is not IP, otherwise use a hostname that we saved when
parsing/adding URLs for the remote gateway.
- Send big buffer in chunks if needed.
- Add caching for qsubs match
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This will allow to signal multiple servers at once to go in
that mode and not have their client reconnect to one of the
servers in the group.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Start the lame duck mode in a go routine in the signal handler
because I think we want to be able to shutdown the server while
in that mode.
Kept the closing as a loop in the lameDuckMode() function (did
not use a timer).
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When receiving SIGUSR2 signal (or -sl ldm) the server stops
accepting new clients, closes routes connections and spread the
closing of client connections based on a config lame duck duration
(default is 30sec). This will help preventing a storm of client
reconnect when a server needs to be shutdown.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
- Increate WriteDeadline test that otherwise could cause a client
connect to fail
- Check failed NumRoutes() with retry
- Check that subs are propagated in route permissions test
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>