370 Commits

Author SHA1 Message Date
Ivan Kozlovic
27ae160f75 Use CID and LeafNodeURLs as an indicator connected to proper port
First, the test should be done only for the initial INFO and only
for solicited connections. Based on the content of INFO coming
from different "listen ports", use the CID and LeafNodeURLs for
the indication that we are connected to the proper port.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-29 14:43:41 -07:00
Waldemar Quevedo
ecb5008fe3 Add check prevent leafnode connecting to client port
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2020-01-28 12:43:27 -08:00
Ivan Kozlovic
20768b72c3 [FIXED] Display of connections address when using IPv6
When the server logs information related to a connection, it uses
the connection IP and remote port as a prefix. When it was an IPv6
address, the square brackets would be missing.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-22 09:12:39 -07:00
Ivan Kozlovic
e94f1b7afb Remove debug trace in writeLoop
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-17 12:09:39 -07:00
Ivan Kozlovic
bd28a015b1 [FIXED] Sublist isSubsetMatch to handle empty tokens
If a subject has empty tokens, returns false.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-14 18:28:14 -07:00
Ivan Kozlovic
85a4c6d17a Updates based on code review
- Use const maxWait=3sec that is used to create and reset the timer
- Remove the lastReport check

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-10 15:58:58 -07:00
Ivan Kozlovic
0e4369cb6a Replace sync.Cond with go channel for writeLoop notification
Also make the wait bound to 3secs after which writeLoop will attempt
to flush. Will log if it timed out on the wait and entering with
fsp > 0. Limit the report to once every 10 minutes

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-10 14:19:10 -07:00
Ivan Kozlovic
971d2350ed Fixed client stalled duration computation and added back Gosched()
This related to PR #1233.

The computation of the time to stall a fast producer was bogus. Fixed
that and added a unit test for the function computing this stalled
duration.

Also, in PR #1233, I had removed Gosched() when a call to flushOutbound()
realizes that the flag is already set. It was forgetting that readLoop
in some cases will call flushOutbound() in place. So there is still
value in unlock/gosched/lock again in that function.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-10 11:21:38 -07:00
Ivan Kozlovic
c097357b52 [FIXED] More than expected switch to Interest-Only mode for account
When an account is switched to interest-only mode due to no interest,
it was not possible to switch that account more than once. But the
function switchAccountToInterestMode() that triggers a switch could
possibly doing it more than once. This should not cause problems
but increased the number of traces in a big super cluster.

Also fixed some flappers and a data race.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-09 13:35:08 -07:00
Ivan Kozlovic
c73be88ac0 Updated based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-06 16:57:48 -07:00
Ivan Kozlovic
947798231b [UPDATED] TCP Write and SlowConsumer handling
- All writes will now be done by the writeLoop, unless when the
  writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
  not failed. The idea is that routes, gateway, etc.. connections
  should stay connected as much as possible. However if a flush
  operation times out and no data at all has been written, the
  connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
  This allows sending of SUBs through routes, etc.. to not have
  to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
  based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
  the "fast path".

Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-31 15:06:27 -07:00
Derek Collison
a2ebf08593 Should allow multiple stream imports on same subject
Signed-off-by: Derek Collison <derek@nats.io>
2019-12-14 17:06:14 -08:00
Ivan Kozlovic
1b2754475b Refactor async client tests
Updated all tests that use "async" clients.
- start the writeLoop (this is in preparation for changes in the
  server that will not do send-in-place for some protocols, such
  as PING, etc..)
- Added missing defers in several tests
- fixed an issue in client.go where test was wrong possibly causing
  a panic.
- Had to skip a test for now since it would fail without server code
  change.

The next step will be ensure that all protocols are sent through
the writeLoop and that the data is properly flushed on close (important
for -ERR for instance).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-12 11:58:24 -07:00
Derek Collison
ffc3c0da70 Fixed #1144, qsub performance improvements
Signed-off-by: Derek Collison <derek@nats.io>
2019-12-09 22:08:59 +01:00
Ivan Kozlovic
b78ca2f63b Fixes for system events
- Call flushOutbound() for SYSTEM connections
- Flush in place in internalSendLoop when sending the shutdown event
- Fix some tests:
  - missing defer client connection Close()
  - ensure subs are registered and messages received before shutdown
    of leafnode server to check disconnected event's stats.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-04 20:55:55 -07:00
Ivan Kozlovic
a22da91647 [FIXED] Closing of Gateway or Route TLS connection may hang
This could happen if the remote server is running but not dequeueing
from the socket. TLS connection Close() may send/read and so we
need to protect with a deadline.

For non client/leaf connection, do not call flushOutbound().
Set the write deadline regardless of handshakeComplete flag, and
set it to a low value.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-04 17:27:00 -07:00
Ivan Kozlovic
1068857fd2 Update computeRTT to check for <= 0
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-21 18:09:27 -07:00
Ivan Kozlovic
63138509f7 Tune some code/test for Windows
Running test suite on a Windows VM, I notice several failures.
Updated the compute of the RTT to be at least 1ns. I think that
this is just an issue with the VM I am running, but that change
will have no impact for normal situations (since setting the rtt
to the very minimum duration (1ns) instead of 0) and will prevent
some tests from failing.

Because of those same timer granularity issues, I had to add some
delays between some actions in order for time.Sub()/Since() to
actually report something more than 0.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-21 14:32:46 -07:00
Derek Collison
6ad8287bbe Introduced wildcard handling of _R_ mapped replies.
We had too much special processing, so reduced to a single wildcard
which will propagate across routes and gateways and is consistent
with gateway handling of globally routed subjects and timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
2019-11-16 12:50:53 -08:00
Derek Collison
c2d2307670 Some tweaks for performance
Signed-off-by: Derek Collison <derek@nats.io>
2019-11-14 14:19:50 -08:00
Ivan Kozlovic
3e5ede1d64 Relax check on reserved GW prefix for system clients
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-11 17:43:14 -07:00
Ivan Kozlovic
b561bde366 Alternate approach to GW reply mapping expiration
Use centralized sync map to gather *client that have GW replies.
Tested with concurrent receiving clients and perf is as good as
with timer per client but reduces need of that timer per client
object.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-11 13:36:24 -07:00
Ivan Kozlovic
9b7dab0548 Updates based on code review
- Add atomic in client to skip check in processInboundClientMsg()
  if value is 0. Avoids getting the lock in fast path if not needed.
- Have a timer per client instead of the global server list that
  was expiring: noticed a lot of contention there when running
  some perf/profiling tests. The timer is also not reset for
  every timestamp that is not yet expired since this too affects
  performance. Instead fires are regular interval and cleared
  when map is empty after a cycle.
- Move processing of gw map rely on its own function (in inbound msg).
  I have verified that this is inlined same way as when code was
  directly in processInboundClientMsg.
- Use string(subj[]) for prefix detection: I have verified that
  it is actually faster.
- Builds the RMSG with appends to local buffer in handleGatewayReply()
  instead of using fmt.Sprintf().

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-08 15:56:28 -07:00
Ivan Kozlovic
aa843945c9 Work on Gateways reply mapping
- New prefix that includes origin server for the request
- Mapping done if request is service import or requestor has
  recent subscription
- Subscription considered recent if less than 250ms
- Destination server strip GW prefix before giving to client
  and restore when getting a reply on that subject
- Mapping removed aftert 250ms
- Server rejects client publish on "$GNR." (the new prefix)
- Cluster and server hash are now 8 chars long and from base 62
  alphabets
- Mapped replies need to be sent to leafnode servers due to race
  (cluster B sends RS+ on GW inbound then RMSG on outbound, the
  RS+ may be processed later and cluster A may have given message
  to LN before RS+ on reply subject. So LN needs to accept the
  mapped reply but will strip to give to client and reassemble
  before sending it back)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-06 16:06:49 -07:00
Derek Collison
f0f807f99a After speaking with Ivan we are taking a better approach for initial RTT.
Ivan had the idea of using the CONNECT to establish a first estimate of RTT
without additional PING/PONGs.

Signed-off-by: Derek Collison <derek@nats.io>
2019-10-31 14:01:55 -07:00
Derek Collison
13f217635f Wait on requestor RTT when tracking latency.
If a client RTT for a requestor is longer than a service RTT, the requestor latency was often zero.
We now wait for the RTT (if zero) before sending out the metric.

Signed-off-by: Derek Collison <derek@nats.io>
2019-10-31 08:02:45 -07:00
Ivan Kozlovic
eb1c2ab72a Merge pull request #1175 from nats-io/fix_1174
[FIXED] Server should not send RTT PING before sending initial PONG
2019-10-30 20:36:07 -06:00
Ivan Kozlovic
cbbc21ac25 Some update to leafnode subscription handling
- Send all subs in place if smap is small
- Skip sending update until after sendAllLeafSubs() is done

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-10-30 20:01:49 -06:00
Ivan Kozlovic
17a7d0d866 [FIXED] Server should not send RTT PING before sending initial PONG
As soon as server has processed a client CONNECT, it was possible
that if Connz() or other was requested, the server will send a
PING to compute the RTT. This would cause clients that expect
the first PONG as part of synchronous CONNECT logic to fail.

Make sure that we delay the first RTT ping to after sending the
first PONG, or if client does not send PING as part of the CONNECT,
after 2 seconds have elapsed since the tcp connection was accepted.

Resolves #1174

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-10-30 19:50:19 -06:00
Jaime Piña
78966fbfa4 Reduce 2019-09-27 16:38:43 -07:00
Jaime Piña
64664946e7 Add QueueSubscribe permissions.
```
users = [
  {
    user: "foo", permissions: {
      sub: {
        # Allow plain subscription foo, but only v1 groups or *.dev queue groups
        allow: ["foo", "foo v1", "foo v1.>", "foo *.dev"]

        # Prevent queue subscriptions on prod groups
        deny: ["> *.prod"]
     }
  }
]
```

Signed-off-by: Jaime Piña <jaime@synadia.com>
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2019-09-27 16:08:24 -07:00
Waldemar Quevedo
d0e36f3b88 Adjust to zero negative latency values
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2019-09-20 09:24:18 -07:00
Jaime Piña
ab24cddc06 Add latency config
Currently, the config file doesn't recognize the latency config block in
account exports. This change exposes those settings in the config file.

Signed-off-by: Jaime Piña <jaime@synadia.com>
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2019-09-18 13:20:26 -07:00
Derek Collison
52430c304a System level services for debugging.
This is the first pass at introducing exported services to the system account for generally debugging of blackbox systems.
The first service reports number of subscribers for a given subject. The payload of the request is the subject, and optional queue group, and can contain wildcards.

Signed-off-by: Derek Collison <derek@nats.io>
2019-09-17 09:37:35 -07:00
Derek Collison
94f143ccce Latency tracking updates.
Will now breakout the internal NATS latency to show requestor client RTT, responder client RTT and any internal latency caused by hopping between servers, etc.

Signed-off-by: Derek Collison <derek@nats.io>
2019-09-11 16:43:19 -07:00
Ivan Kozlovic
effa30ce4a [FIXED] MaxPending > MaxInt32 causes client to be disconnected
Changed some of client.outbound fields to int64.
Moved fields around to minimize size of struct (checked with
unsafe.Sizeof())
Checked benchmark results before/after
Added test

Resolves #1118

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-09-11 14:29:02 -06:00
Derek Collison
67470911fe Prune remote reply tracking
Signed-off-by: Derek Collison <derek@nats.io>
2019-08-30 17:35:20 -07:00
Derek Collison
bb11f7bd2d Merge pull request #1111 from nats-io/latency
Track latency for exported services
2019-08-30 11:02:36 -07:00
Derek Collison
7989118c3f First pass latency tracking for exported services
Signed-off-by: Derek Collison <derek@nats.io>
2019-08-30 10:52:48 -07:00
Ivan Kozlovic
2a8973a62b Fixed flushOutbound
With Go 1.12 (strangely was not able to reproduce with Go 1.11)
the test TestRouteNoCrashOnAddingSubToRoute() would frequently
locks up and consume all avail CPUs on the machine. Running
this test with GOMAXPROCS=2 you would see server.test CPU usage
pegged at 200% (assuming you have at least 2 CPUs).
The reason was that the writeLoop was spinning because another
routine was already in flushOutbound() and stack trace would
show that it was stuck in system calls. It seems that even though
the writeLoop does release the lock but grab it right away was
not allowing the syscall to complete.

So decided to put back the unlock/gosched/lock back in flushOutbound()
when flag is already set, but then protect the closeConnection()
with its own flag (similar to clearConnection) to not re-introduce
issue fixed in #1092.

Had to fix the benchmark test RoutedInterestGraph because after a
route is accepted, the initial PING will be sent after 1sec which
was breaking this test.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-29 12:59:27 -06:00
Ivan Kozlovic
2f48ad5150 Fixed subscription close
I noticed that TestNoRaceRoutedQueueAutoUnsubscribe started to
fail a lot on Travis. Running locally I could see a 45 to 50%
failures. After investigation I realized that the issue was that
we have wrongly re-used `subscription.nm` and set to -1 on unsubscribe
however, I believe that it was possible that when subscription was
closed, the server may have already picked that consumer for a delivery
which then causes nm==-1 to be bumped to 0, which was wrong.
Commenting out the subscription.close() that sets nm to -1, I could
not get the test to fail on macOS but would still get 7% failure on
Linux VM. Adding the check to see if sub is closed in deliverMsg()
completely erase the failures, even on Linux VM.

We could still use `nm` set to -1 but check on deliverMsg(), the
same way I use the closed int32 now.

Fixed some flappers.
Updated .travis.yml to failfast if one of the command in the
`script` fails. User `set -e` and `set +e` as recommended in
https://github.com/travis-ci/travis-ci/issues/1066

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-20 14:39:23 -06:00
Waldemar Quevedo
5c776d4363 Fix typo
Signed-off-by: Waldemar Quevedo <wally@synadia.com>
2019-08-13 19:59:28 -07:00
Ivan Kozlovic
c20afd4016 [FIXED] Connection could be closed twice
This was introduced in PR#930. The first commit had the route's
check if the flushOutbound() returned false, and if so would
locally unlock/lock the connection's lock. Unfortunately, this
was replaced in the second commit (a6aeed3a6b)
to the flushOutbound() function itself.
This causes the function closeConnection() to possibly unlock
the connection while calling flushOutbound(), which if the
connection is closed due to both a tls timeout for instance
and explicitly, it would result in the connection being scheduled
for a reconnect (if explicit gateway connection, possibly route).

Added defensive code in Gateway to register a unique outbound gateway.

Fixed a test that was now failing with newer Go version in which
they fixed url.Parse()

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-13 20:11:03 -06:00
Derek Collison
8f5bc503e5 Add ability for cross account import services to return streams as well as singeltons.
Take into account tracking of response maps that are created and do proper cleanup.
Also fixes #1089 which was discovered while working on this.

Signed-off-by: Derek Collison <derek@nats.io>
2019-08-06 14:15:40 -07:00
Ivan Kozlovic
b537f130cc Use goto to remove entry from cache
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-29 20:52:57 -06:00
Ivan Kozlovic
6fd6ac2821 Update based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-29 20:38:22 -06:00
Ivan Kozlovic
887e744d07 [FIXED] Reduce memory usage on routes
When a route receives a message, it uses a thread local cache to
find the account and subscriptions match for a given subject.
When not found, an entry is added to this cache. The problem is
that this cache will reference subscriptions that in turn
reference connections.
When the subscriptions/connections are closed, this thread local
cannot be purged from those closed subscriptions (since it is
thread local - no lockin is used).
The real issue is that connection's buffer was not set to nil on
close, which then could cause more than expected memory to be
still referenced. Setting the buffer to nil will help reduce the
memory being used.
When an entry is added to the cache, the cache may reach a size
that will cause the server to prune some entries. From time to
time, the cache will be scanned to look for entries that contain
only closed subscriptions and remove those.

Resolves #1082

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-29 17:54:21 -06:00
Derek Collison
5bec08ac6a Added support for user and activation token revocation
Signed-off-by: Derek Collison <derek@nats.io>
2019-07-28 06:49:39 -07:00
Derek Collison
8bfe14bbfd check response perms more often, make sure we limit memory growth
Signed-off-by: Derek Collison <derek@nats.io>
2019-07-25 16:53:54 -07:00
Derek Collison
495a1a7ec3 Allow dynamic publish permissions based on reply subjects of received msgs
Signed-off-by: Derek Collison <derek@nats.io>
2019-07-25 13:17:26 -07:00