Commit Graph

94 Commits

Author SHA1 Message Date
Ivan Kozlovic
2ad2bed170 [ADDED] Support for route hostname resolution
We previously simply called DialTimeout() on a route's url when
soliciting. If it resolved to the IP of the host, it would create
a route to self, which server detects, but then would not try again
with other IPs that would have allowed to form a cluster with
other servers running on the other IPs.

This PR keeps track of local IPs + cluster port and exclude them
from the list of IPs returned by LookupHost API. This even prevent
solicitation of routes to self. Only non-local IPs will be tried.

Resolves #1586

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-09-08 13:40:17 -06:00
Ivan Kozlovic
20a67a5be8 Websocket: add option to disable TLS
The new option Websocket.NoTLS would have to be set to true
to disable the server check that enforces TLS configuration.

Resolves #1529

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-29 17:33:02 -06:00
Ivan Kozlovic
9b0967a5d1 [FIXED] Handling of gossiped URLs
If some servers in the cluster have the same connect URLs (due
to the use of client advertise), then it would be possible to
have a server sends the connect_urls INFO update to clients with
missing URLs.

Resolves #1515

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-15 17:39:12 -06:00
Ivan Kozlovic
9288283d90 Fixed accept loops that could leave connections opened
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).

So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.

Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-06 17:03:19 -06:00
Ivan Kozlovic
27540ee255 Fixed some flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-03 11:30:48 -06:00
Derek Collison
2b9e3e5b15 Merge pull request #1476 from nats-io/cluster_name
Cluster names are now required.
2020-06-15 10:07:30 -07:00
Derek Collison
146d8f5dcb Updates based on feedback, sped up some slow tests
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 17:26:43 -07:00
Derek Collison
dd61535e5a Cluster names are now required.
Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters.

Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 15:48:38 -07:00
Ivan Kozlovic
67d2638859 [ADDED] Print the config file being used in startup banner
Resolves #1451

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-12 12:21:50 -06:00
Ivan Kozlovic
b9bd5c2d35 Fixed flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-09 15:34:52 -06:00
Ivan Kozlovic
cd6d71deaa [ADDED] lame_duck_grace_period option
The grace period used to be hardcoded at 10 seconds.
This option allows the user to configure the amount of time the
server will wait before initiating the closing of client connections.

Note that the grace period needs to be strictly lower than the overall
lame_duck_duration. The server deducts the grace period from that
overall duration and spreads the closing of connections during
that time.
For instance, if there are 1000 connections and the lame duck
duration is set to 30 seconds and grace period to 10, then
the server will use 30-10 = 20 seconds to spread the closing
of those 1000 connections, so say roughly 50 clients per second.

Resolves #1459.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-08 11:43:25 -06:00
Ivan Kozlovic
98ea70a590 LameDuckMode takes into account websocket accept loop
This is related to #1408.
Make sure that we close the websocket "accept loop" if configured
before proceeding with the lame duck mode.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-02 17:49:38 -06:00
Derek Collison
05e38ae527 Merge branch 'master' into sys-acc 2020-06-01 11:53:14 -07:00
Derek Collison
2bd7553c71 System Account on by default.
Most of the changes are to turn it off for tests that were watching subscriptions and such.

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-29 17:56:45 -07:00
Ivan Kozlovic
44e78a1fb6 Fixed some tests
- A race test may have consumed a lot of fds going in TIME_WAIT
that could cause some issues for other tests
- Missing defer filestore.Stop() that would leave flushLoop()
routines
- A defer for the from server in a LeafNode test
- Rework [Re]ConnectErrorReports that was failing often for me
locally (probably due to exhaustion of fds - too many TIME_WAIT).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-29 17:47:08 -06:00
Ivan Kozlovic
e9805a3109 [FIXED] Possible removal of interest on queue subs with leaf nodes
Server was incorrectly processing a queue subscription removal
as both a plain sub and queue sub, which may have resulted in
drop of interest even when some queue subs remained.

Resolves #1421

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-28 10:21:51 -06:00
Ivan Kozlovic
8678a61e3e Move the send of INFO after client listener has been shutdown
This will ensure that there is no race where clients are accepted
after the LDM INFO notification.

Also add to the test to make sure that we don't send INFO when
routes are disconnected due to internal closing of connections
during the shutdown process.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-24 11:38:49 -06:00
Ivan Kozlovic
dc0f688cbf [FIXED] LameDuckMode sends INFO to clients
Also send an INFO to routes so that the remotes can remove the
LDM's server client URLs and notify their own clients of this
change.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-21 12:15:20 -06:00
Ivan Kozlovic
d1276ad038 Add TLS 1.3 (and new ciphers) in the tlsVersion output
Also changed unknown version to "0x.." to show that value is hexa.

Resolves #1313

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-03-18 10:09:23 -06:00
Ivan Kozlovic
5eebf02e5f Fixed TestVersionMatchesTag test
When no tag was set, os.Getenv("TRAVIS_TAG") would return empty string.
Travis now set TRAVIS_TAG to `''`. So check for both.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-03-09 10:13:32 -06:00
Ivan Kozlovic
156bf7b381 Updates based on code review
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-02-19 16:52:41 -07:00
Ivan Kozlovic
8e4b449119 Fixed flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-02-19 13:19:08 -07:00
Ivan Kozlovic
1b2754475b Refactor async client tests
Updated all tests that use "async" clients.
- start the writeLoop (this is in preparation for changes in the
  server that will not do send-in-place for some protocols, such
  as PING, etc..)
- Added missing defers in several tests
- fixed an issue in client.go where test was wrong possibly causing
  a panic.
- Had to skip a test for now since it would fail without server code
  change.

The next step will be ensure that all protocols are sent through
the writeLoop and that the data is properly flushed on close (important
for -ERR for instance).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-12 11:58:24 -07:00
Ivan Kozlovic
63138509f7 Tune some code/test for Windows
Running test suite on a Windows VM, I notice several failures.
Updated the compute of the RTT to be at least 1ns. I think that
this is just an issue with the VM I am running, but that change
will have no impact for normal situations (since setting the rtt
to the very minimum duration (1ns) instead of 0) and will prevent
some tests from failing.

Because of those same timer granularity issues, I had to add some
delays between some actions in order for time.Sub()/Since() to
actually report something more than 0.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-21 14:32:46 -07:00
R.I.Pienaar
bcf96fa1de Allows a descriptive server_name to be set
This adds a new config option server_name that
when set will be exposed in varz, events and more
as a descriptive name for the server.

If unset though the server_name will default to the pk

Signed-off-by: R.I.Pienaar <rip@devco.net>
2019-10-17 18:51:19 +02:00
Ivan Kozlovic
0a72993d80 Add warning for TLS insecure setting on LeafNodes
Also fix for #1071 in that we need to check remote gateways TLS
config even if main gateway section is not configured with TLS.

Related to #1071

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-12 17:22:57 -06:00
Ivan Kozlovic
37d08a6c56 [FIXED] Allow TLS InsecureSkipVerify again
This has an effect only on connections created by the server,
so routes and gateways (explicit and implicit).
Make sure that an explicit warning is printed if the insecure
property is set, but otherwise allow it.

Resolves #1062

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-12 12:10:28 -06:00
Derek Collison
10d4f1ab7a Convert leafnode solicited remotes to array
Signed-off-by: Derek Collison <derek@nats.io>
2019-07-10 11:53:34 -07:00
Derek Collison
0f20592fb3 Made leafnode connect a Debugf to be consistent, added first connect Noticef.
Signed-off-by: Derek Collison <derek@nats.io>
2019-06-29 19:11:02 -07:00
Derek Collison
257b670ae2 Cleaned up logging for leafnodes
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-30 15:53:14 -07:00
Ivan Kozlovic
d2578f9e05 Update to connect/reconnect error reports logic
Changed the introduced new option and added a new one. The idea
is to be able to differentiate between never connected and reconnected
event. The never connected situation will be logged at first attempt
and every hour (by default, configurable).
However, once connected and if trying to reconnect, will report every
attempts by default, but this is configurable too.

These two options are supported for config reload.

Related to #1000
Related to #1001
Resolves #969

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-26 17:51:01 -06:00
Ivan Kozlovic
48c3f7f846 Fixed some flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-24 09:53:35 -06:00
Ivan Kozlovic
7272e4e317 Make the error report attempts configurable
This is a continuation of #1000. Added a configuration to specify
the number of attempts at which the repeated error is reported.
The algo is now to print only the 1st attempt and when current
attempt % <this config param> == 0.

Resolves #969

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-20 16:28:48 -06:00
Ivan Kozlovic
03930ba0e4 [UPDATED] Reduce report of failed connection attempts
This applies to routes, gateways and leaf node connections.
The failed attempts will be printed at the first, after the first
minute and then every hour.
The connect/error statements now include the attempt number.

Note that in debug mode, all attempts are traced, so you may get
double trace (one for debug, one for info/error).

Resolves #969

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-20 10:13:56 -06:00
Derek Collison
d7140a0fd1 Update for client rename
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-10 15:11:30 -07:00
Derek Collison
acfe372d63 Changes for rename from gnatsd -> nats-server
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-06 15:04:24 -07:00
Alexei Volkov
83aefdc714 [ADDED] Cluster tls insecure configuration
Based on @softkbot PR #913.
Removed the command line parameter, which then removes the need for Options.Cluster.TLSInsecure.
Added a test with config reload.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-03-11 14:48:22 -06:00
Ivan Kozlovic
42f45ce5b6 [FIXED] Possible delays in delivering messages
There is a possibility that a partial write results in data
not being sent in a timely fashion to a subscription.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-02-02 18:42:50 -07:00
Ivan Kozlovic
111e050d32 Allow service import to work with Gateways
This is not complete solution and is a bit hacky but is a start
to be able to have service import work at least in some basic
cases.

Also fixed a bug where replySub would not be removed from
connection's list of subs after delivery.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-05 20:35:43 -07:00
Ivan Kozlovic
0ba587249a Fixing setting of default gateway TLS Timeout
Moved setting to the default value in setBaselineOptions()
so that config reload does not fail.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-03 18:20:15 -07:00
Ivan Kozlovic
52c724a83c Updates based on comments
- Solve RS+ with wildcards
- Solve issue with messages not send to remote gateways queue subs
  if there was a qsub on local server.
- Made rcache a perAccountCache since it is now used by routes and
  gateways
- Order outbound gateways only on RTT updates
- Print a server's gateway name on startup
- Augment/add some tests
- Update TLS handling: when connecting, use hostname for ServerName
  if url is not IP, otherwise use a hostname that we saved when
  parsing/adding URLs for the remote gateway.
- Send big buffer in chunks if needed.
- Add caching for qsubs match

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-27 19:39:41 -07:00
Ivan Kozlovic
10fd3ca0c6 Gateways [WIP]
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-27 19:00:03 -07:00
Ivan Kozlovic
eb17950971 Introduce some delay before closing clients in LameDuck mode.
This will allow to signal multiple servers at once to go in
that mode and not have their client reconnect to one of the
servers in the group.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-08 18:34:15 -07:00
Ivan Kozlovic
1817b354e3 Update tests on Travis with tweaked GC settings
Moved some tests to "no race" tests that are run separately.
Removing -v and adding -p=1.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-08 16:56:20 -07:00
Ivan Kozlovic
c173d55e2e Update based on comments
Start the lame duck mode in a go routine in the signal handler
because I think we want to be able to shutdown the server while
in that mode.

Kept the closing as a loop in the lameDuckMode() function (did
not use a timer).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-10-22 16:27:30 -06:00
Ivan Kozlovic
0067c3bb04 Added support for lame duck mode
When receiving SIGUSR2 signal (or -sl ldm) the server stops
accepting new clients, closes routes connections and spread the
closing of client connections based on a config lame duck duration
(default is 30sec). This will help preventing a storm of client
reconnect when a server needs to be shutdown.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-10-19 19:07:37 -06:00
Derek Collison
e78d587083 Added support for maximum subscriptions per connection
Signed-off-by: Derek Collison <derek@nats.io>
2018-07-01 15:13:59 -07:00
Derek Collison
3b953ce838 Allow localhost to not be defined, only need 127.0.0.1
Signed-off-by: Derek Collison <derek@nats.io>
2018-06-28 16:10:19 -07:00
Ivan Kozlovic
aff1dcf089 Fix some tests
Add some helpers to check on some state.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-06-27 17:26:49 -06:00
Ivan Kozlovic
0e422812cd Tune some more tests
- Increate WriteDeadline test that otherwise could cause a client
  connect to fail
- Check failed NumRoutes() with retry
- Check that subs are propagated in route permissions test

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-06-26 18:52:56 -06:00