When a route connection is created, the server will keep track
of the client structure in a special map until the route protocol
completes. This is meant so that if the server is shutdown before
the route is registered in routes map, the server can kick out
the connection's readLoop.
The route connection was correctly removed on success, but was
not for route connections that were not registered and dropped.
This was not causing any issue, but for correctness, doing the
removal now when server removes a route connection.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
I noticed that when running the test suite, there would be a file
server/log1.txt left. This file is created by one of the config
reload test. Running this test individually was doing the proper
cleanup. I noticed that the Signal test that was checking
that files could be rotated was causing this side effect.
It turns out that none of the config reload tests were disabling
the signal handler (NoSigs=true), and since the go routine would
be left running, running the TestSignalToReOpenLogFile() test
would interact with an already finished test.
I put a thread dump in handleSignals() to track all tests that
were causing this function to start the go routine because NoSigs
was not set to true. I fixed all those tests. At this time, there
are only 2 tests that need to start the signal handler.
I have also fixed the code so that the signal handler routine select
on a server quitCh that is closed on shutdown so that this go routine
exit and is waiting on using the grWG wait group.
This PR is based out of #633. It imroves parsing QRSID so that the
TestRouteQueueSemantics test now passes (when dealing with malformed
QRSID).
A test similar to what is reported in #632 was also added. This
test however, uncovers a race condition that will be fixed in a
separate PR.
Resolves#632
This is the result of flapping tests in go-nats that were caused
by a defect (see PR https://github.com/nats-io/go-nats/pull/348).
However, during debugging, I realize that there were also things
that were not quite right in the server side. This change should
make it the notification of cluster topology changes to clients
more robust.
When a server accepts a route, it will keep track of that server
`connectURLs` array. However, if the server was creating a route
to that other server at the same time, it will promote the route
as a solicited one. The content of that array was not transfered,
which means that on a disconnect, it was possible that the cluster
topology change was not properly sent to clients.
Until now, a server would only notify clients of servers that join
the cluster. More than that, a server would send ot its clients only
information if new servers were added.
This PR changes this by sending to clients that support async INFO
the list of URLs for all servers in the cluster any time that there
is a change (joining or leaving the cluster).
As of now, clients will not be affected by the change (and will not
take benefit of this: removing servers from their server pool). This
will be addressed in each supported client once this is merged.
When the option Cluster.NoAdvertise is false, a server will send
an INFO protocol message to its client when a server has joined
the cluster.
Previously, the protocol would be sent only if the
joining server's "client URLs" (the addresses where clients connect
to) were new. It will now be sent regardless if the server joins
(for the first time) or rejoins the cluster.
Clients are still by default invoking the DiscoveredServersCB callback
only if they themselves detect that new URLs were added. A separate
PR may be filled to client libraries repo to be able to invoke
the callback anytime an async INFO protocol is received.
Based on @madgrenadier PR #597.
- Created a setter for the closed flag.
- Check if route is closed under lock and set a boolean if so,
so we don't check c.route outside of c's mutex.
- Ensure that we do not create a route on shutdown, which would
leave a connection hanging (was seen in some config reload tests).
When A connects to B and B connects to A (either based on static
configuration - explicit routes, or because of auto-discovery -
implicit routes), it is possible that each server initially
registers the route from the opposite TCP connection. It will
then result in each server dropping the connection.
We were previously setting a retry flag in the first accepted route
based on the name of servers, which means that regardless of
duplicate detection, the server with the "smaller" server name would
try to reconnect when the route connection was closed. For instance,
suppose that server B connects to server A, when B disconnects, A
would try to reconnect once to B. This became problematic in the
case of configuration reload, because removing the route from B to
A would still result in a route created from A to B.
Also, when a route attempts a reconnect, a random delay is added
to avoid repeated failure cycles that may occur in case where
A connects to B and B to A.
gnatsd currently uses a global logger. This can cause some problems
(especially around the config-reload work), but global variables are
also just an anti-pattern in general. The current behavior is
particularly surprising because the global logger is configured through
calls to the Server.
This addresses issue #500 by removing the global logger and making it a
field on Server.
When a server is told to connect to a server (with auto-discovery),
it tries to connect once. There have been a report where that
connection fails, but would probably succeed if tried again (#408).
This new parameter allows to configure the number of times a failed
implicit connect should be tried.
Resolves#408
The RunServer() function (and the various variants)
call Server.Start() in a go-routine, but do not return until
it has verified that the server is ready to accept connections.
To do so, it use GetListenEndpoint() to get a suitable connect
address (replacing "0.0.0.0" or "::" with localhost - important
on Windows). It then creates a raw TCP connection to ensure the
server is started, repeating the process in case of failure up
to 10 seconds.
This PR replaces this with a function that checks that client
listener, and route listener if configured, are set. This removes
the need to get a connect address and create test tcp connections.
The reason for this change is that NATS Streaming when starting
the NATS Server (unless configured to connect to a remote one)
calls RunServerWithAuth(), which when getting "localhost" from
GetListenEndpoint(), would fail trying to resolve it. This happened
for the NATS Streaming Docker image built with Go 1.7+.
Running `go vet ./...` with `go 1.7.3` would report the following:
```
server/route.go:342: assignment copies lock value to tlsConfig: crypto/tls.Config contains sync.Once contains sync.Mutex
server/server.go:479: assignment copies lock value to config: crypto/tls.Config contains sync.Once contains sync.Mutex
```
Add a “clone” function while waiting for this to be addressed
by the language itself (https://go-review.googlesource.com/#/c/28075/)
By default, a server is now sending to its clients the client URLs
of all servers in the cluster. This allows clients to be able
to reconnect to any server in the cluster even if those clients
were not configured with the list of servers in the cluster.
However, there may be cases where it would make sense to disable
this feature. This now can be done with this option/command line
parameter.
Resolves#322
Trying to use IPv6 address for the cluster host would fail.
Also, there were some unclosed channels in case of accept loop
setup failures.
Resolves#323
Clients that will be at the ClientProtoInfo protocol level (or above)
will now receive an asynchronous INFO protocol when the server
they connect to adds a *new* route. This means that when the cluster
adds a new server, all clients in the cluster should now be notified
of this new addition.
Ensure that all socket writes are protected with deadlines.
For connection Close(), also use deadlines since in case of TLS,
the Close() will send an alert (do a write) if the handshake was
completed. If the peer is not reading, this would cause the Close()
to hang.
Refactor the way client is initialized. We need to ensure that
clients are not added to the clients map and readLoop started if
the server is in the process of being shutdown otherwise there
is a chance that the server already gathered the list of connections
to close and this one would not be included, leaving a readLoop
running.
Same occurs for routes, with the complexity that the readLoop is
started well before the route connection is added to the server
routes' list. We need a temporary map that contains those connections
to be able to close them on server Shutdown.
Fixed some flapping tests.
We need to make sure that when Shutdown() returns, routes go routines
that try to connect or reconnect have returned. Otherwise, this may
affect tests running one after the other (a server from one test
may connect to a server in the next test).
-No need to store ip url string in c.route and resolve remote IP
when forwarding the INFO to known servers.
-When checking if a route is explicit, use strings.ToLower() once
for the url being checked.