Commit Graph

58 Commits

Author SHA1 Message Date
Derek Collison
75ae7c6032 When an account asked for connz should be client and leaf connections only by default.
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-15 11:04:23 -07:00
Derek Collison
f07a86c6db Merge branch 'main' into acc-connz
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-14 18:13:43 -07:00
Derek Collison
cdb5a56329 Fix for flapping test
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-14 15:26:27 -07:00
Derek Collison
14572b080b Fixed and moved large purge test to no race
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-14 13:07:46 -07:00
Derek Collison
10167b1bcf Added in ability for normal accounts to access scoped connz info.
Added in client kind and sub type for clients.
Added in ability to filter connections based on matching subject interest.

Signed-off-by: Derek Collison <derek@nats.io>
2021-08-13 10:19:12 -07:00
Matthias Hanel
d6de19c649 Made test more predictable by waiting for leader after leader shutdown
Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-08-12 12:34:15 -04:00
Derek Collison
c9875e09a0 Fix for a flapper
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-12 06:13:47 -07:00
Derek Collison
29536629eb Simplified flow control, avoid stalls due to msg loss
Signed-off-by: Derek Collison <derek@nats.io>
2021-08-09 20:13:17 -07:00
Derek Collison
4e92b0ed6e When a server was restarting, if a stream had a MaxAge and there were a very large amount of messages to expire, this would take too long.
During normal operation and quick restarts the number of expired messages per cycle is manageable and correct.
However if a server is shutdown for quite a long time and many messages have expired this process is too slow.

This commit introduces an optimized expiration tailored for startup vs running state.

Signed-off-by: Derek Collison <derek@nats.io>
2021-07-30 12:48:47 -07:00
Derek Collison
f13fa767c2 Remove the swapping of accounts during processing of service imports.
When processing service imports we would swap out the accounts during processing.
With the addition of internal subscriptions and internal clients publishing in JetStream we had an issue with the wrong account being used.
This was specific to delyaed pull subscribers trying to unsubscribe due to max of 1 while other JetStream API calls were running concurrently.
2021-07-26 07:57:10 -07:00
Derek Collison
6eef31c0fc Fixed peer info reports that had large last active values.
Also put in safety for lag going upside down as well.

Signed-off-by: Derek Collison <derek@nats.io>
2021-07-06 10:14:43 -07:00
Derek Collison
960c45df81 Use of sync.Pool for filestore could cause msg corruption.
Signed-off-by: Derek Collison <derek@nats.io>
2021-07-06 08:41:01 -07:00
Derek Collison
63479ff8fd Bump threshold
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-27 08:33:46 -07:00
Derek Collison
a27f198b83 Skip for now, covermode blows up memory and latency thresholds
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-23 13:50:14 -07:00
Derek Collison
225c8b4a85 Bump threshold
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-22 17:44:19 -07:00
Derek Collison
b3753aba1b Improvements to filtered purge and general memory use for filestore.
We optimized the filtered purge to skip msgBlks that are not in play.
Also optimized msgBlock buffer usage by using two sync.Pools to enhance reuse.

Signed-off-by: Derek Collison <derek@nats.io>
2021-06-22 15:47:26 -07:00
R.I.Pienaar
c6b85fd101 update for review
Signed-off-by: R.I.Pienaar <rip@devco.net>
2021-06-22 08:47:08 +02:00
R.I.Pienaar
c9bf329a99 test to show slow purges
Signed-off-by: R.I.Pienaar <rip@devco.net>
2021-06-21 17:01:49 +02:00
Derek Collison
6219f0381d Test rename for no race versions
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-15 09:41:11 -07:00
Derek Collison
d9a0ff904c Bump timeout threshold
Signed-off-by: Derek Collison <derek@nats.io>
2021-06-15 08:53:11 -07:00
Derek Collison
08cdb2d2ea Make filtered consumers in large mixed streams more efficient.
Allow wider scoped filtered subjects.

We introduce a per subject information tracking to filestore to optimize for large mux'd streams and more efficient filtered consumers.

Signed-off-by: Derek Collison <derek@nats.io>
2021-06-15 04:44:05 -07:00
Derek Collison
820c76d3c8 Fix flapper
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-19 11:43:43 -07:00
Derek Collison
946335d62f Increase clients and runtime
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-16 14:18:40 -07:00
Derek Collison
d7641b9d38 Move test to norace
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-16 14:00:11 -07:00
Derek Collison
adba4fde5a Add large stress test, skipped by default
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-16 13:58:32 -07:00
Derek Collison
395728bab9 Allow control messages like heartbeats to pass the old sub test.
Signed-off-by: Derek Collison <derek@nats.io>
2021-04-14 14:11:02 -07:00
Jaime Piña
d929ee1348 Check errors when removing test directories and files
Currently in tests, we have calls to os.Remove and os.RemoveAll where we
don't check the returned error. This hides useful error messages when
tests fail to run, such as "too many open files".

This change checks for more filesystem related errors and calls t.Fatal
if there is an error.
2021-04-07 11:09:47 -07:00
Jaime Piña
6941bb3ade Update Go client in tests 2021-03-30 13:17:34 -07:00
Derek Collison
2ed53035ed Reworked flow control for sources and mirrors.
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-24 07:07:33 -07:00
Matthias Hanel
b316cccfd1 Fixed a quorum formation issue that caused truncation
When a new leader is elected it has to give everyone a chance to reply,
so that we can observe rejections with higher term.

The maximum election timeout is 7.5 seconds.
The new behavior of waiting for the election timeout caused unit tests
to fail. Hence upping the timeout there as well.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2021-03-11 19:44:47 -05:00
Derek Collison
2b2a776411 Disable flaky tests for now
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-11 07:11:05 -05:00
Ivan Kozlovic
e7e756034a Switch Gateway JS accounts to interest-only mode + some other fixes
- Fixed the close of a TLS connection which starting Go 1.16
set the deadline to 5 seconds.

- Fixed an issue with setHeader that was causing these error messages
```
=== RUN   TestServiceImportReplyMatchCycleMultiHops
nats: message could not decode headers on connection [4] for subscription on "foo"
--- PASS: TestServiceImportReplyMatchCycleMultiHops (0.04s)
```

- Fixed names of tests in norace_test.go since they must start with
TestNoRace in order to make sure that we execute them in Travis:
```
go test -v -run=TestNoRace --failfast -p=1 ./...
```

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2021-03-03 19:15:28 -07:00
Derek Collison
401484299d Flaps with cluster size of 5 too much
Signed-off-by: Derek Collison <derek@nats.io>
2021-03-03 06:34:07 -08:00
Derek Collison
09e3d26fa3 Add in support for stream mirrors and sources.
Add in proper support for stream updates in clustered mode.
Don't send API updates without subjects, caused GW parser errors.
Stream internal loops use their own clients now.

Signed-off-by: Derek Collison <derek@nats.io>
2021-02-23 10:57:27 -08:00
Derek Collison
c16f6e193d Move JetStream direct APIs to private.
Signed-off-by: Derek Collison <derek@nats.io>
2021-02-07 15:19:22 -08:00
Ivan Kozlovic
46a4969813 Moved test to ones run without -race and cap number of conns
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-10-22 10:11:16 -06:00
Ivan Kozlovic
0c804f5ffb Moving TestQueueAutoUnsubscribe to norace_test.go
This test has been found to cause TestAccountNATSResolverFetch to
fail on macOS. We did not find the exact reason yet, but it seem
that with `-race`, the queue auto-unsub test (that creates 2,000
queue subs and sends 1,000 messages) cause mem to grow to 256MB
(which we know -race is memory hungry) and that may be causing
interactions with the account resolver test.

For now, moving it to norace_test.go, which consumes much less
memory (25MB) and anyway is a better place since it would stress
better the "races" of having a queue sub being unsubscribed while
messages were inflight to this queue sub.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-09-29 18:06:16 -06:00
Ivan Kozlovic
22833c8d1a Fix sysSubscribe races
Made changes to processSub() to accept subscription properties,
including the icb callback so that it is set prior to add the
subscription to the account's sublist, which prevent races.
Fixed some other racy conditions, notably in addServiceImportSub()

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-08-03 14:59:00 -06:00
Ivan Kozlovic
96ccf91566 [FIXED] Possible deadlock with solicited leafnodes when cluster conflict
We cannot call c.closeConnection() under the server lock because
closeConnection() can invoke server lock in some cases.

Created a test that should run without `-race` to reproduce the deadlock
(which it does) but sometimes would fail because cluster would not be
formed. This unconvered an issue with conflict resolution which
test TestRouteClusterNameConflictBetweenStaticAndDynamic() can reproduce
easily. The issue was that we were not updating a dynamic name with
the remote if the remote was non dynamic.

Resolves #1543

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-07-30 18:45:36 -06:00
aricart
e7590f3065 jwt2 testbed 2020-06-01 18:00:13 -04:00
Derek Collison
2bd7553c71 System Account on by default.
Most of the changes are to turn it off for tests that were watching subscriptions and such.

Signed-off-by: Derek Collison <derek@nats.io>
2020-05-29 17:56:45 -07:00
Ivan Kozlovic
d20efffccb Fix TestNoRaceRouteCache test
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2020-05-25 06:58:23 -07:00
Ivan Kozlovic
1b2754475b Refactor async client tests
Updated all tests that use "async" clients.
- start the writeLoop (this is in preparation for changes in the
  server that will not do send-in-place for some protocols, such
  as PING, etc..)
- Added missing defers in several tests
- fixed an issue in client.go where test was wrong possibly causing
  a panic.
- Had to skip a test for now since it would fail without server code
  change.

The next step will be ensure that all protocols are sent through
the writeLoop and that the data is properly flushed on close (important
for -ERR for instance).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-12-12 11:58:24 -07:00
Ivan Kozlovic
0bfd03091b Clean tmp accounts map when race gets duplicate
Added check to the test to ensure that tmp map is empty.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-16 18:14:23 -07:00
Ivan Kozlovic
3e1728d623 [FIXED] Some accounts locking issues
- Risk of deadlock when checking if issuer claim are trusted. There
  was a RLock() in one thread, then a request for Lock() in another
  that was waiting for RLock() to return, but the first thread was
  then doing RLock() which was not acquired because this was blocked
  by the Lock() request (see e2160cc571)

- Use proper account/locking mode when checking if stream/service
  exports/signer have changed.

- Account registration race (regression from https://github.com/nats-io/nats-server/pull/890)

- Move test from #890 to "no race" test since only then could it detect
  the double registration.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-16 16:59:38 -07:00
Ivan Kozlovic
aa843945c9 Work on Gateways reply mapping
- New prefix that includes origin server for the request
- Mapping done if request is service import or requestor has
  recent subscription
- Subscription considered recent if less than 250ms
- Destination server strip GW prefix before giving to client
  and restore when getting a reply on that subject
- Mapping removed aftert 250ms
- Server rejects client publish on "$GNR." (the new prefix)
- Cluster and server hash are now 8 chars long and from base 62
  alphabets
- Mapped replies need to be sent to leafnode servers due to race
  (cluster B sends RS+ on GW inbound then RMSG on outbound, the
  RS+ may be processed later and cluster A may have given message
  to LN before RS+ on reply subject. So LN needs to accept the
  mapped reply but will strip to give to client and reassemble
  before sending it back)

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-11-06 16:06:49 -07:00
Ivan Kozlovic
2f48ad5150 Fixed subscription close
I noticed that TestNoRaceRoutedQueueAutoUnsubscribe started to
fail a lot on Travis. Running locally I could see a 45 to 50%
failures. After investigation I realized that the issue was that
we have wrongly re-used `subscription.nm` and set to -1 on unsubscribe
however, I believe that it was possible that when subscription was
closed, the server may have already picked that consumer for a delivery
which then causes nm==-1 to be bumped to 0, which was wrong.
Commenting out the subscription.close() that sets nm to -1, I could
not get the test to fail on macOS but would still get 7% failure on
Linux VM. Adding the check to see if sub is closed in deliverMsg()
completely erase the failures, even on Linux VM.

We could still use `nm` set to -1 but check on deliverMsg(), the
same way I use the closed int32 now.

Fixed some flappers.
Updated .travis.yml to failfast if one of the command in the
`script` fails. User `set -e` and `set +e` as recommended in
https://github.com/travis-ci/travis-ci/issues/1066

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-20 14:39:23 -06:00
Ivan Kozlovic
07e3db6b8e Prepare for v2.0.4 with goreleaser
Also fixed some flappers

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-08-15 09:06:56 -06:00
Ivan Kozlovic
6fd6ac2821 Update based on comments
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-29 20:38:22 -06:00
Ivan Kozlovic
887e744d07 [FIXED] Reduce memory usage on routes
When a route receives a message, it uses a thread local cache to
find the account and subscriptions match for a given subject.
When not found, an entry is added to this cache. The problem is
that this cache will reference subscriptions that in turn
reference connections.
When the subscriptions/connections are closed, this thread local
cannot be purged from those closed subscriptions (since it is
thread local - no lockin is used).
The real issue is that connection's buffer was not set to nil on
close, which then could cause more than expected memory to be
still referenced. Setting the buffer to nil will help reduce the
memory being used.
When an entry is added to the cache, the cache may reach a size
that will cause the server to prune some entries. From time to
time, the cache will be scanned to look for entries that contain
only closed subscriptions and remove those.

Resolves #1082

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2019-07-29 17:54:21 -06:00