92 Commits

Author SHA1 Message Date
Waldemar Quevedo
8b7dfe7d74 monitoring: track slow consumers per connection type
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 05:57:42 -07:00
Ivan Kozlovic
83c5c0177a Changes based on code review
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-03 09:32:28 -06:00
Ivan Kozlovic
105237cba8 [ADDED] Multiple routes and ability to have per-account routes
New configuration fields:
```
cluster {
   ...
   pool_size: 5
   accounts: ["A", "B"]
}
```

The configuration `pool_size` in the example above means that this
server will create 5 routes to a remote server, assuming that that
server has the same `pool_size` setting.

Accounts (which are not part of the `accounts[]` configuration)
are assigned a specific route in this pool, and this will be the
same route on all servers in the cluster.

Accounts that are defined in the `accounts` field will each have
a dedicated route connection. This will allow suppression of the
account name in some of the route protocols, reducing bytes transmitted
which may increase performance.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-04-03 09:32:25 -06:00
Marco Primi
f8a030bc4a Use testing.TempDir() where possible
Refactor tests to use go built-in temporary directory utility for tests.

Also avoid binding to default port (which may be in use)
2022-12-12 13:18:44 -08:00
Ivan Kozlovic
a98af54d92 Fix some flappers and update Go version for nightly code coverage
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-10-10 16:42:43 -06:00
Ivan Kozlovic
b69ffe244e Fixed some tests
Code change:
- Do not start the processMirrorMsgs and processSourceMsgs go routine
if the server has been detected to be shutdown. This would otherwise
leave some go routine running at the end of some tests.
- Pass the fch and qch to the consumerFileStore's flushLoop otherwise
in some tests this routine could be left running.

Tests changes:
- Added missing defer NATS connection close
- Added missing defer server shutdown

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-08 11:28:23 -06:00
Ivan Kozlovic
8c1c6951dc JetStream: R1 durables were incorrectly migrated on shutdown
This could happen for stream with R>1 but with a durable that
has an override of R=1.

Fixed a test to make sure assets have an elected leader.

Also fixed a gateway test that would cause a data race.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-07 19:49:16 -06:00
Ivan Kozlovic
f6c4e5fcee [CHANGED] Gateway: Switch all accounts to interest-only mode
We are phasing out the optimistic-only mode. Servers accepting
inbound gateway connections will switch the accounts to interest-only
mode.

The servers with outbound gateway connection will check interest
and ignore the "optimistic" mode if it is known that the corresponding
inbound is going to switch the account to interest-only. This is
done using a boolean in the gateway INFO protocol.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-19 16:41:44 -06:00
Ivan Kozlovic
5d3ee8ebf4 [FIXED] Gateway: possible panic if monitor endpoint inspected too soon
The monitoring http server is started early and the gateway setup
(when configured) may not be fully ready when the `/gatewayz`
endpoint is inspected and could cause a panic.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-17 13:30:58 -06:00
Ivan Kozlovic
98c1f0ecb2 Fixed some data race and some flappers
Got a data race:
```
==================
WARNING: DATA RACE
Write at 0x00c001c736b0 by goroutine 605:
  runtime.mapassign_faststr()
      /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/runtime/map_faststr.go:202 +0x0
  github.com/nats-io/nats-server/v2/server.(*Account).addServiceImport()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1868 +0xb7b
  github.com/nats-io/nats-server/v2/server.(*Account).AddServiceImportWithClaim()
...
Previous read at 0x00c001c736b0 by goroutine 301:
  runtime.mapaccess2_faststr()
      /home/travis/.gimme/versions/go1.17.8.linux.amd64/src/runtime/map_faststr.go:107 +0x0
  github.com/nats-io/nats-server/v2/server.(*Server).registerSystemImports()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/events.go:1577 +0x284
  github.com/nats-io/nats-server/v2/server.(*Server).updateAccountClaimsWithRefresh()
...
```

Also, remove some condition in gateway.go on how we were checking
if a subject was a serviec reply, which was causing a test to flap.

Finally, used AckSync() in a rest (instead of m.Respond(nil)) to
prevent it from flapping.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-29 19:02:41 -06:00
Ivan Kozlovic
63c750e295 [CHANGED] Gateway: Detect duplicate names between clusters
Gateway connection will be closed and error reported if a remote
has a name that is a duplicate of the local cluster.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-15 15:00:13 -06:00
Ivan Kozlovic
b4128693ed Ensure file path is correct during stream restore
Also had to change all references from `path.` to `filepath.` when
dealing with files, so that it works properly on Windows.

Fixed also lots of tests to defer the shutdown of the server
after the removal of the storage, and fixed some config files
directories to use the single quote `'` to surround the file path,
again to work on Windows.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-09 13:31:51 -07:00
Ivan Kozlovic
5fc9e0e1cc [FIXED] Gateway URLs gossip and /varz report issues
- When detecting duplicate route, it was possible that a server
would lose track of the peer's gateway URL, which would prevent
it from gossiping that URL to inbound gateway connections
- When a server has gateways enabled and has as a remote its
own gateway, the monitoring endpoint `/varz` would include it
but without the "urls" array.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-10-28 12:05:30 -06:00
Phil Pennock
fc6df0fbbc Redact URLs before logging or returning in error (#2643)
* Redact URLs before logging or returning in error

This does not affect strings which failed to parse, and in such a scenario
there's a mix of "which evil" to accept; we can't sanely find what should be
redacted in those cases, so we leave them alone for debugging.

The JWT library returns some errors for Operator URLs, but it rejects URLs
which contain userinfo, so there can't be passwords in those and they're safe.

Fixes #2597

* Test the URL redaction auxiliary functions

* End-to-end tests for secrets in debug/trace

Create internal/testhelper and move DummyLogger there, so it can be used from
the test/ sub-dir too.

Let DummyLogger optionally accumulate all log messages, not just retain the
last-seen message.

Confirm no passwords logged by TestLeafNodeBasicAuthFailover.

Change TestNoPasswordsFromConnectTrace to check all trace messages, not just the
most recent.

Validate existing trace redaction in TestRouteToSelf.

* Test for password in solicited route reconnect debug
2021-10-27 12:44:59 -04:00
Ivan Kozlovic
a025ce7472 Set defaultServerOptions port to -1 for random
Updated some tests based on this change but also missing defer
connection close or server shutdown.

Fixed how the OCSP run go routine would shutdown, which would
never complete because grWG was not decremented by this go routine
prior to invoking s.Shutdown()

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-09-02 14:22:56 -06:00
Ivan Kozlovic
2881e4a1f0 [FIXED] MQTT fixes and improvements
Some issues that have been fixed would manifest by timeouts on
connect, unexpected memory usage on high publish message rate.

Some details:
- Replies were not always GW routed properly because we were looking
at the wrong connection's rsubs
- GW routed replies would not be found because they were tracked
in the subscription's client object, which may not be the same used
to send the reply
- Increased the mqtt timeout to wait for JS replies since in some
tests it was sometimes taking more than the original 2 seconds
- Incoming gateway messages destined for an MQTT internal subscription
may have been rejected as a no interest if the account had service imports
- Don't use time.After(), instead create explicit timer so it can
be stopped when not timing out.
- Unnecessary copy of a slice since we were converting to a string anyway.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-05-04 20:48:14 -06:00
Jaime Piña
27e9628c3a Run gofmt -s to simplify code 2021-04-09 15:18:06 -07:00
Jaime Piña
d929ee1348 Check errors when removing test directories and files
Currently in tests, we have calls to os.Remove and os.RemoveAll where we
don't check the returned error. This hides useful error messages when
tests fail to run, such as "too many open files".

This change checks for more filesystem related errors and calls t.Fatal
if there is an error.
2021-04-07 11:09:47 -07:00
Ivan Kozlovic
cbcff97244 [CHANGED] Move Gateway interest-only mode switch from INF to DBG
Also fixed a test that would sometimes fail depending on timing.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-14 11:34:36 -06:00
Ivan Kozlovic
8598de6dbe [FIXED] Gateway's implicit connection not using global user/pass
If a gateway is configured with an authorization block containing
username and password and accepts an unknown Gateway connection,
when initiating the outbound connection, it should use the
gateway authorization's user/pass information.

Resolves #1912

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-02-16 10:06:06 -07:00
Derek Collison
f0cdf89c61 JetStream Clustering WIP
Signed-off-by: Derek Collison <derek@nats.io>
2021-01-14 01:14:52 -08:00
Ivan Kozlovic
d24e9b75b3 Fixed GW implicit reconnection
PR #1412 had a fix for races during implicit GW reconnection.
However, the fix was a bit too simplistic in that it was checking
only if there was any inbound gateway to decide to try to reconnect
an implicit disconnected GW. We need to check the name, not only
presence of inbound GW connections.

Related to #1412

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-12-28 12:28:55 -07:00
Ivan Kozlovic
fc1521636c [FIXED] Config reload for gateways/leaf remote TLS configurations
Presence of TLS config in any remote gateway or leafnode would
cause the config reload to fail (because TLS config internal
content may change which fails the DeepEqual check).

This PR excludes the TLS configs in such case to check for
changes in gateways and leafnodes.

Although GW and LN config reload is technically supported, this
PR updates the internal remotes' TLS configuration so that
changes/updates to TLS certificates would take effect after
a configuration reload.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-12-11 16:56:25 -07:00
Ivan Kozlovic
ffd476357e [CHANGED] Gateway connections now always send PINGs
Connections normally suppress sending PINGs if there was some
activity. We now force GATEWAY connections to send PINGs at the
configured interval or 15 seconds, whichever is the smallest.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-11-03 13:13:09 -07:00
Ivan Kozlovic
df9d5f5fd9 Accepting route warns if remote server has same name
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-10-08 17:59:33 -06:00
Phil Pennock
3c680eceb9 Inhibit Go's default TCP keepalive settings for NATS (#1562)
Inhibit Go's default TCP keepalive settings for NATS

Go 1.13 changed the semantics of the tuning parameters for TCP keepalives, including the default value.  This affects all TCP listeners.  The NATS protocol has its own L7 keepalive system (PING/PONG) and the Go defaults are not a good fit for some valid deployment scenarios, while Go doesn't directly expose a working API for tuning these.

Rather than add a configuration knob and pull in another dependency (with portability issues) just disable TCP keepalives for all listeners used for speaking the NATS protocol.

Change the tests so we test the same logic.  Do not change HTTP monitoring, profiling, or the websocket API listeners.

Change KeepAlive on client connections too.
2020-08-14 13:37:59 -04:00
Ivan Kozlovic
9288283d90 Fixed accept loops that could leave connections opened
This was discovered with the test TestLeafNodeWithGatewaysServerRestart
that was sometimes failing. Investigation showed that when cluster B
was shutdown, one of the server on A that had a connection from B
that just broke tried to reconnect (as part of reconnect retries of
implicit gateways) to a server in B that was in the process of shuting down.
The connection had been accepted but createGateway not called because
the server's running boolean had been set to false as part of the shutdown.
However, the connection was not closed so the server on A had a valid
connection to a dead server from cluster B. When the B cluster (now single
server) was restarted and a LeafNode connection connected to it, then
the gateway from B to A was created, that server on A did not create outbound
connection to that B server because it already had one (the zombie one).

So this PR strengthens the starting of accept loops and also make sure
that if a connection (all type of connections) is not accepted because
the server is shuting down, that connection is properly closed.

Since all accept loops had almost same code, made a generic function
that accept functions to call specific create connection functions.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-06 17:03:19 -06:00
Derek Collison
2b9e3e5b15 Merge pull request #1476 from nats-io/cluster_name
Cluster names are now required.
2020-06-15 10:07:30 -07:00
Derek Collison
146d8f5dcb Updates based on feedback, sped up some slow tests
Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 17:26:43 -07:00
Derek Collison
dd61535e5a Cluster names are now required.
Added cluster names as required for prep work for clustered JetStream. System can dynamically pick a cluster name and settle on one even in large clusters.

Signed-off-by: Derek Collison <derek@nats.io>
2020-06-12 15:48:38 -07:00
Ivan Kozlovic
d6de05f49a Fixed a test with data race
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-12 13:04:05 -06:00
Ivan Kozlovic
b9bd5c2d35 Fixed flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-06-09 15:34:52 -06:00
Derek Collison
2bd7553c71 System Account on by default.
Most of the changes are to turn it off for tests that were watching subscriptions and such.

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-29 17:56:45 -07:00
Ivan Kozlovic
e976e63099 Fixing some flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-25 06:58:23 -07:00
Ivan Kozlovic
5dba3cdd75 [FIXED] Race condition during implicit Gateway reconnection
Say server in cluster A accepts a connection from a server in
cluster B.
The gateway is implicit, in that A does not have a configured
remote gateway to B.
Then the server in B is shutdown, which A detects and initiate
a single reconnect attempt (since it is implicit and if the
reconnect retries is not set).
While this happens, a new server in B is restarted and connects
to A. If that happens before the initial reconnect attempt
failed, A will register that new inbound and do not attempt to
solicit because it has already a remote entry for gateway B.
At this point when the reconnect to old server B fails, then
the remote GW entry is removed, and A will not create an outbound
connection to the new B server.

We fix that by checking if there is a registered inbound when
we get to the point of removing the remote on a failed implicit
reconnect. If there is one, we try the reconnection.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-22 13:01:17 -06:00
Ivan Kozlovic
9715848a8e [ADDED] Websocket support
Websocket support can be enabled with a new websocket
configuration block:

```
websocket {
    # Specify a host and port to listen for websocket connections
    # listen: "host:port"

    # It can also be configured with individual parameters,
    # namely host and port.
    # host: "hostname"
    # port: 4443

    # This will optionally specify what host:port for websocket
    # connections to be advertised in the cluster
    # advertise: "host:port"

    # TLS configuration is required
    tls {
      cert_file: "/path/to/cert.pem"
      key_file: "/path/to/key.pem"
    }

    # If same_origin is true, then the Origin header of the
    # client request must match the request's Host.
    # same_origin: true

    # This list specifies the only accepted values for
    # the client's request Origin header. The scheme,
    # host and port must match. By convention, the
    # absence of port for an http:// scheme will be 80,
    # and for https:// will be 443.
    # allowed_origins [
    #    "http://www.example.com"
    #    "https://www.other-example.com"
    # ]

    # This enables support for compressed websocket frames
    # in the server. For compression to be used, both server
    # and client have to support it.
    # compression: true

    # This is the total time allowed for the server to
    # read the client request and write the response back
    # to the client. This include the time needed for the
    # TLS handshake.
    # handshake_timeout: "2s"
}
```

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-20 11:14:39 -06:00
Derek Collison
0129a7fa09 Header support for GWs
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:33:56 -07:00
Derek Collison
a5dbc20e94 Fix for test failure
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:26:46 -07:00
Derek Collison
ea5e5bd364 Services rewrite #2
This contains a rewrite to the services layer for exporting and importing. The code this merges to already had a first significant rewrite that moved from special interest processing to plain subscriptions.

This code changes the prior version's dealing with reverse mapping which was based mostly on thresholds and manual pruning, with some sporadic timer usage. This version uses the jetstream branch's code that understands interest and failed deliveries. So this code is much more tuned to reacting to interest changes. It also removes thresholds and goes only by interest changes or expirations based around a new service export property, response thresholds. This allows a service provider to provide semantics on how long a response should take at a maximum.

This commit also introduces formal support for service export streamed and chunked response types send an empty message to signify EOF.

This commit also includes additions to the service latency tracking such that errors are now sent, not only successful interactions. We have added a Status field and an optional Error fields to ServiceLatency.

We support the following Status codes, these are directly from HTTP.

400 Bad Request (request did not have a reply subject)
408 Request Timeout (when system detects request interest went away, old request style to make dependable)..
503 Service Unavailable (no service responders running)
504 Service Timeout (The new response threshold expired)

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:26:46 -07:00
Derek Collison
df774e44b0 Rework how service imports are handled to avoid performance hits
Signed-off-by: Derek Collison <derek@nats.io>
2020-05-19 14:18:34 -07:00
Ivan Kozlovic
fef94759ab [FIXED] Update remote gateway URLs when node goes away in cluster
If a node in the cluster goes away, an async INFO is sent to
inbound gateway connections so they get a chance to update their
list of remote gateway URLs. Same happens when a node is added
to the cluster.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-04-20 13:48:47 -06:00
Ivan Kozlovic
8e4b449119 Fixed flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-02-19 13:19:08 -07:00
Ivan Kozlovic
bd28a015b1 [FIXED] Sublist isSubsetMatch to handle empty tokens
If a subject has empty tokens, returns false.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-14 18:28:14 -07:00
Ivan Kozlovic
c097357b52 [FIXED] More than expected switch to Interest-Only mode for account
When an account is switched to interest-only mode due to no interest,
it was not possible to switch that account more than once. But the
function switchAccountToInterestMode() that triggers a switch could
possibly doing it more than once. This should not cause problems
but increased the number of traces in a big super cluster.

Also fixed some flappers and a data race.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-09 13:35:08 -07:00
Ivan Kozlovic
947798231b [UPDATED] TCP Write and SlowConsumer handling
- All writes will now be done by the writeLoop, unless when the
  writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
  not failed. The idea is that routes, gateway, etc.. connections
  should stay connected as much as possible. However if a flush
  operation times out and no data at all has been written, the
  connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
  This allows sending of SUBs through routes, etc.. to not have
  to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
  based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
  the "fast path".

Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-31 15:06:27 -07:00
Ivan Kozlovic
a22da91647 [FIXED] Closing of Gateway or Route TLS connection may hang
This could happen if the remote server is running but not dequeueing
from the socket. TLS connection Close() may send/read and so we
need to protect with a deadline.

For non client/leaf connection, do not call flushOutbound().
Set the write deadline regardless of handshakeComplete flag, and
set it to a low value.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-04 17:27:00 -07:00
Ivan Kozlovic
a0f8bd112e [FIXED] Prevent A- for account that has service reply subscription
Prevent sending an A- for a given account if the server has this
account registered and an internal service reply subscription.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-26 16:21:36 -07:00
Ivan Kozlovic
63138509f7 Tune some code/test for Windows
Running test suite on a Windows VM, I notice several failures.
Updated the compute of the RTT to be at least 1ns. I think that
this is just an issue with the VM I am running, but that change
will have no impact for normal situations (since setting the rtt
to the very minimum duration (1ns) instead of 0) and will prevent
some tests from failing.

Because of those same timer granularity issues, I had to add some
delays between some actions in order for time.Sub()/Since() to
actually report something more than 0.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-21 14:32:46 -07:00
Derek Collison
6ad8287bbe Introduced wildcard handling of _R_ mapped replies.
We had too much special processing, so reduced to a single wildcard
which will propagate across routes and gateways and is consistent
with gateway handling of globally routed subjects and timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
2019-11-16 12:50:53 -08:00
Ivan Kozlovic
b561bde366 Alternate approach to GW reply mapping expiration
Use centralized sync map to gather *client that have GW replies.
Tested with concurrent receiving clients and perf is as good as
with timer per client but reduces need of that timer per client
object.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-11 13:36:24 -07:00