Commit Graph

51 Commits

Author SHA1 Message Date
Ivan Kozlovic
8e4b449119 Fixed flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-02-19 13:19:08 -07:00
Ivan Kozlovic
bd28a015b1 [FIXED] Sublist isSubsetMatch to handle empty tokens
If a subject has empty tokens, returns false.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-14 18:28:14 -07:00
Ivan Kozlovic
c097357b52 [FIXED] More than expected switch to Interest-Only mode for account
When an account is switched to interest-only mode due to no interest,
it was not possible to switch that account more than once. But the
function switchAccountToInterestMode() that triggers a switch could
possibly doing it more than once. This should not cause problems
but increased the number of traces in a big super cluster.

Also fixed some flappers and a data race.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-01-09 13:35:08 -07:00
Ivan Kozlovic
947798231b [UPDATED] TCP Write and SlowConsumer handling
- All writes will now be done by the writeLoop, unless when the
  writeLoop has not been started yet (likely in connection init).
- Slow consumers for non CLIENT connections will be reported but
  not failed. The idea is that routes, gateway, etc.. connections
  should stay connected as much as possible. However if a flush
  operation times out and no data at all has been written, the
  connection will be closed (regardless of type).
- Slow consumers due to max pending is only for CLIENT connections.
  This allows sending of SUBs through routes, etc.. to not have
  to be chunked.
- The backpressure to CLIENT connections is increased (up to 1sec)
  based on the sub's connection pending bytes level.
- Connection is flushed on close from the writeLoop as to not block
  the "fast path".

Some tests have been fixed and adapted since now closeConnection()
is not flushing/closing/removing connection in place.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-31 15:06:27 -07:00
Ivan Kozlovic
a22da91647 [FIXED] Closing of Gateway or Route TLS connection may hang
This could happen if the remote server is running but not dequeueing
from the socket. TLS connection Close() may send/read and so we
need to protect with a deadline.

For non client/leaf connection, do not call flushOutbound().
Set the write deadline regardless of handshakeComplete flag, and
set it to a low value.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-04 17:27:00 -07:00
Ivan Kozlovic
a0f8bd112e [FIXED] Prevent A- for account that has service reply subscription
Prevent sending an A- for a given account if the server has this
account registered and an internal service reply subscription.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-26 16:21:36 -07:00
Ivan Kozlovic
63138509f7 Tune some code/test for Windows
Running test suite on a Windows VM, I notice several failures.
Updated the compute of the RTT to be at least 1ns. I think that
this is just an issue with the VM I am running, but that change
will have no impact for normal situations (since setting the rtt
to the very minimum duration (1ns) instead of 0) and will prevent
some tests from failing.

Because of those same timer granularity issues, I had to add some
delays between some actions in order for time.Sub()/Since() to
actually report something more than 0.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-21 14:32:46 -07:00
Derek Collison
6ad8287bbe Introduced wildcard handling of _R_ mapped replies.
We had too much special processing, so reduced to a single wildcard
which will propagate across routes and gateways and is consistent
with gateway handling of globally routed subjects and timeouts.

Signed-off-by: Derek Collison <derek@nats.io>
2019-11-16 12:50:53 -08:00
Ivan Kozlovic
b561bde366 Alternate approach to GW reply mapping expiration
Use centralized sync map to gather *client that have GW replies.
Tested with concurrent receiving clients and perf is as good as
with timer per client but reduces need of that timer per client
object.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-11 13:36:24 -07:00
Ivan Kozlovic
cacfb4a08c Fix some gateway tests
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-08 19:07:57 -07:00
Ivan Kozlovic
9b7dab0548 Updates based on code review
- Add atomic in client to skip check in processInboundClientMsg()
  if value is 0. Avoids getting the lock in fast path if not needed.
- Have a timer per client instead of the global server list that
  was expiring: noticed a lot of contention there when running
  some perf/profiling tests. The timer is also not reset for
  every timestamp that is not yet expired since this too affects
  performance. Instead fires are regular interval and cleared
  when map is empty after a cycle.
- Move processing of gw map rely on its own function (in inbound msg).
  I have verified that this is inlined same way as when code was
  directly in processInboundClientMsg.
- Use string(subj[]) for prefix detection: I have verified that
  it is actually faster.
- Builds the RMSG with appends to local buffer in handleGatewayReply()
  instead of using fmt.Sprintf().

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-08 15:56:28 -07:00
Ivan Kozlovic
aa843945c9 Work on Gateways reply mapping
- New prefix that includes origin server for the request
- Mapping done if request is service import or requestor has
  recent subscription
- Subscription considered recent if less than 250ms
- Destination server strip GW prefix before giving to client
  and restore when getting a reply on that subject
- Mapping removed aftert 250ms
- Server rejects client publish on "$GNR." (the new prefix)
- Cluster and server hash are now 8 chars long and from base 62
  alphabets
- Mapped replies need to be sent to leafnode servers due to race
  (cluster B sends RS+ on GW inbound then RMSG on outbound, the
  RS+ may be processed later and cluster A may have given message
  to LN before RS+ on reply subject. So LN needs to accept the
  mapped reply but will strip to give to client and reassemble
  before sending it back)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-06 16:06:49 -07:00
Ivan Kozlovic
75ec78c232 [FIXED] Explicit gateway not using discovered URLs
If cluster A configures a gateway to cluster B, the server on A
tries to connect to that server URL. If there is no server on B
at that address, but a server on B with different address connects
to server on cluster A, that server should be able to create its
outbound connection in response.
That was not the case because the configured URLs were snapshot
before the loop of trying to connect. When accepting an inbound
connection and updating the array, this new URL was not being used.

The issue is only if the server on A had no outbound connection
at that time.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-10-24 16:40:38 -06:00
Ivan Kozlovic
77c63dbce1 Fix flappers
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-20 17:07:22 -06:00
Ivan Kozlovic
2f48ad5150 Fixed subscription close
I noticed that TestNoRaceRoutedQueueAutoUnsubscribe started to
fail a lot on Travis. Running locally I could see a 45 to 50%
failures. After investigation I realized that the issue was that
we have wrongly re-used `subscription.nm` and set to -1 on unsubscribe
however, I believe that it was possible that when subscription was
closed, the server may have already picked that consumer for a delivery
which then causes nm==-1 to be bumped to 0, which was wrong.
Commenting out the subscription.close() that sets nm to -1, I could
not get the test to fail on macOS but would still get 7% failure on
Linux VM. Adding the check to see if sub is closed in deliverMsg()
completely erase the failures, even on Linux VM.

We could still use `nm` set to -1 but check on deliverMsg(), the
same way I use the closed int32 now.

Fixed some flappers.
Updated .travis.yml to failfast if one of the command in the
`script` fails. User `set -e` and `set +e` as recommended in
https://github.com/travis-ci/travis-ci/issues/1066

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-20 14:39:23 -06:00
Ivan Kozlovic
c20afd4016 [FIXED] Connection could be closed twice
This was introduced in PR#930. The first commit had the route's
check if the flushOutbound() returned false, and if so would
locally unlock/lock the connection's lock. Unfortunately, this
was replaced in the second commit (a6aeed3a6b)
to the flushOutbound() function itself.
This causes the function closeConnection() to possibly unlock
the connection while calling flushOutbound(), which if the
connection is closed due to both a tls timeout for instance
and explicitly, it would result in the connection being scheduled
for a reconnect (if explicit gateway connection, possibly route).

Added defensive code in Gateway to register a unique outbound gateway.

Fixed a test that was now failing with newer Go version in which
they fixed url.Parse()

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-13 20:11:03 -06:00
Ivan Kozlovic
ed1901c792 Update go.mod to satisfy v2 requirements
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-06-03 19:45:47 -06:00
Ivan Kozlovic
37b3546e7b Switch gateway to InterestMode only once
When a leafnode connection is created, the server forces all
gateway inbound connections to switch to InterestMode. Do this only
once, regardless of how many times the LN (re)connects.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-30 17:21:15 -06:00
Ivan Kozlovic
2d4c3dd38f Added logging of account interest mode switch for gateways
Both sides will log when an account is switched to interest-only
mode. There are 2 traces (start/complete) per account.
They are logged at [INF] level.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-28 14:55:45 -06:00
Ivan Kozlovic
5478eaf01e Added /gatewayz endpoint
Such endpoint will list the gateway/cluster name, address and port
then list of outbound/inbound connections.
For each remote gateway there will be at most one outbound connection.
There can be 0 or more inbound connections for the same remote
gateway.

For each of these outbound/inbound connection, the connection info
similar to Connz is reported. Optionally, one can include the
interest mode/stats for each account.

Here are possible options:

* No specific options

http://host:port/gatewayz

* Limit to specific remote gateway, say name "B":

http://host:port/gatewayz/gw_name=B

* Include accounts (default limit to 1024 accounts)

http://host:port/gatewayz/accs=1

* Specific limit, say 200 (note accs=1 in this case is optional)

http://host:port/gatewayz/accs=1&accs_limit=200

* Specific account, say "acc_1". Note that accs=1 is not required then

http://host:port/gatewayz/acc_name=acc_1

* Above options can be mixed: specific remote gateway (B), with 100
  accounts reported

http://host:port/gatewayz/gw_name=B&accs_limit=200

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-28 12:41:09 -06:00
Ivan Kozlovic
4ed08dde07 Merge pull request #1013 from nats-io/fix_gw_qinterest_loss
Fixed loss of queue subscription interest across Gateways in some cases
2019-05-26 18:23:06 -06:00
Ivan Kozlovic
ce1e6defab Fix flappers
- TestSystemAccountConnectionUpdatesStopAfterNoLocal: I believe that
  the check on number of notifications was wrong. Since we did not
  consume the ones for the connect, the expected count after the
  disconnect is 8 instead of 4.

- Possible fix GW tests complaining about number of outbound/inbound
  I think that it may be possible that connection does not succeed
  right away (remote to fully started, etc) and due to dial timeout
  and reconnect attempt delay, I suspect that when given a max time
  of 1sec to complete, it may not be enough.
  Quick change for now is to override to 2secs for now in the
  wait helpers. If that proves conclusive, we could remove the
  timeout given to these helpers.

- TestGatewaySendAllSubsBadProtocol: used a t.Fatalf() in checkFor
  instead of return fmt.Errorf().

- TestLeafNodeResetsMSGProto: this test is not about change to
  interest mode only, so to avoid possible mix of protos, delay
  a bit creation of gateway after creation of leaf node.

- Some defer s.Shutdown() were missing

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-26 17:17:08 -06:00
Ivan Kozlovic
b325cf1e4a Fixed loss of queue subscription interest across Gateways in some cases
Suppose two servers, SA in cluster A and SB in cluster B. If SA
sends a message to SB on an account for which there is no interest
at all (account not known or no subscription), SB will send an A-
and keep track that it sent an A- for this account.

When a queue subscription is created on SB, SB will send and RS+
to A because A needs to have perfect knowledge of all queue subs
in all clusters.

If then a regular subscription is also created on SB, SB will
think that it needs to send an A+ because it had sent an A- for
this account. However, SA had an entry for this account for the
queue sub. The A+ would clear the entry in the map and would cause
SA to not send messages to SB even if they would have been a
match for the queue sub on SB.

We fix this in two ways:
- Clear the possible A- in SB when sending an RS+ for queue sub
- Processing of A-/A+ to be aware of a possible entry in the map
  due to queue subs.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-25 16:27:00 -06:00
Derek Collison
933f5d0df4 Add in TestGatewayServiceExportWithWildcards
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-21 15:28:42 -07:00
Derek Collison
67bb08af8b Fixes for a few flappers.
TestJWTAccountImportActivationExpires
TestGatewayServiceImportWithQueue

Signed-off-by: Derek Collison <derek@nats.io>
2019-05-21 15:12:31 -07:00
Ivan Kozlovic
1eff7bc112 Fixed gateway test race report
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-05-13 11:49:29 -06:00
Derek Collison
d7140a0fd1 Update for client rename
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-10 15:11:30 -07:00
Derek Collison
acfe372d63 Changes for rename from gnatsd -> nats-server
Signed-off-by: Derek Collison <derek@nats.io>
2019-05-06 15:04:24 -07:00
Derek Collison
f320f318b7 Fixed merge conflict
Signed-off-by: Derek Collison <derek@nats.io>
2019-04-23 17:28:42 -07:00
Derek Collison
bfe83aff81 Make account lookup faster with sync.Map
Signed-off-by: Derek Collison <derek@nats.io>
2019-04-23 17:13:23 -07:00
Ivan Kozlovic
bb4e8ae0f9 Gateways: Fix race for request reply
This addresses the following race:
- client connection creates a subscription on a reply subject
- client connection sends a request
- server sends the subscription to inbound gateway
- server sends the message to outbound gateway (those may be
  to different servers)
- receiving server sends to sub interested in request subject
- app sends reply
- its server then check for interest on the reply's subject

In interestOnly mode, there is a possibility that this server
has not received the interest on the reply subject yet and would
then drop the reply.

This PR detects above scenario and will prefix the reply subject
to identify the origin cluster if it is detected that the last
subscription from the sending connection was created less than
a second ago.
Once the destination has this prefix, the destination cluster
will always send back that message to origin cluster even if
there is no registered interest.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-04-22 20:00:21 -06:00
Ivan Kozlovic
d8098c134b Reduce startup memory for gateways
Similar to #956 but for gateways code.
Also fixing route test TestLargeClusterMem.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-04-17 15:18:46 -06:00
Ivan Kozlovic
4ea96337ed Added test gateway tlsConfig.ServerName
Checks that if not provided server fails to connect to remote
gateway. Once set to expected hostname ("localhost"), connection
works.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-04-12 11:21:57 -06:00
Ivan Kozlovic
18399a3808 Gateways: Rework Account Sub/Unsub
We now send A- if an account does not exists, or if there is no
interest on a given subject and no existing subscription.
An A+ is sent if an A- was previously sent and a subscription
for this account is registered.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-02-26 18:34:30 -07:00
Ivan Kozlovic
7ad4498a09 Gateways: Remove unused permissions options
Permissions were configured but not implemented. Removing for now.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-01-10 09:49:36 -07:00
Ivan Kozlovic
7c220ba700 Support for service export with wildcards
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-13 21:22:01 -07:00
Ivan Kozlovic
519c3dab47 Add Gateway test for service import and interest only
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-11 14:44:02 -08:00
Ivan Kozlovic
4b70cdfc89 Fix Gateways with Service Imports
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-11 00:27:40 -08:00
Ivan Kozlovic
6eaa1dc351 Resolve IP if gateway listen is 0.0.0.0 or ::
Otherwise, this may be sent to servers in the cluster and to other
gateways which may result in attempt to connect to self which
in case of TLS would produce error.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-07 17:28:21 -07:00
Ivan Kozlovic
95a5f79ac7 Added Gateway test for service import with queue group
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-06 19:13:39 -07:00
Ivan Kozlovic
111e050d32 Allow service import to work with Gateways
This is not complete solution and is a bit hacky but is a start
to be able to have service import work at least in some basic
cases.

Also fixed a bug where replySub would not be removed from
connection's list of subs after delivery.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-05 20:35:43 -07:00
Derek Collison
2d54fc3ee7 Account lookup failures, account and client limits, options reload.
Changed account lookup and validation failures to be more understandable by users.
Changed limits to be -1 for unlimited to match jwt pkg.

The limits changed exposed problems with options holding real objects causing issues with reload tests under race mode.
Longer term this code should be reworked such that options only hold config data, not real structs, etc.

Signed-off-by: Derek Collison <derek@nats.io>
2018-12-05 14:25:40 -08:00
Ivan Kozlovic
e7b6c5731e Update based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-03 17:17:55 -07:00
Ivan Kozlovic
a23ef5b740 Switch to send-all-subs when number of RS- gets too big
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-12-03 13:15:11 -07:00
Ivan Kozlovic
f011db47c7 Fixed race issue with lookup/update of the sent no-interest map
We can't use a simple sync.Map here because the noInterest map
for inbound gateway connections are used concurrently. Indeed,
whenever an account would have been registered or a new sub created
this could trigger an update of that map in order to clear the
fact that we had sent an A-/RS- and now are sending an A+/RS+.
So changed to simple map but protected by gw connection's lock.

Without this change, server would panic if there are messages
published to cluster A that are sent to server B while a sub
is then created on matching subject on B.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-29 14:22:56 -07:00
Ivan Kozlovic
60462b2a44 Fixed gateway flapper
Need to make sure message is received before unsub'ing because
otherwise it would be possible that the unsub happens before
message is delivered, which would have resulted in an RS- while
we were expecting the message to not cause one.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-28 19:16:35 -07:00
Ivan Kozlovic
cfc5ec4d44 Fixed test and remove grace period from total duration.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-28 18:25:15 -07:00
Ivan Kozlovic
086b26f14a Gateways: Ignore reference to self
Allows the use of a global include for all gateways and each
gateway will ignore its own reference.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-28 14:24:28 -07:00
Ivan Kozlovic
d78b1ae464 Fixed issue with gateways
- If/when splitting buffer to pass to queueOutbound(), it has to
  be include full protocol.
- Fix counting of total queue subs
- Fix tests
- Send RS- if no plain sub interest even if there is queue sub
  interest.
- Removed a one-liner function

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-28 13:15:47 -07:00
Ivan Kozlovic
52c724a83c Updates based on comments
- Solve RS+ with wildcards
- Solve issue with messages not send to remote gateways queue subs
  if there was a qsub on local server.
- Made rcache a perAccountCache since it is now used by routes and
  gateways
- Order outbound gateways only on RTT updates
- Print a server's gateway name on startup
- Augment/add some tests
- Update TLS handling: when connecting, use hostname for ServerName
  if url is not IP, otherwise use a hostname that we saved when
  parsing/adding URLs for the remote gateway.
- Send big buffer in chunks if needed.
- Add caching for qsubs match

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2018-11-27 19:39:41 -07:00