Underestimated the effort to get stream restore working properly in cluster mode.
Some good bug fixes and stability improvments.
Signed-off-by: Derek Collison <derek@nats.io>
I noticed that some consumer go routines were left running at the end
of the test suite.
It turns out that there was a race the way the consumer's qch was closed.
Since it was closed and then set to nil, it is possible that the go
routines that are started and then try to capture o.qch would actually
get qch==nil, wich then when doing a select on that nil channel would
block forever.
So we know pass the qch to the 2 go routines loopAndGatherMsgs() and
loopAndDeliverMsgs() so that when we close the channel there is
no risk of that race happening.
I do believe that there is still something that should be looked at:
it seems that a consumer's delivery loop can now be started/stopped
many times based on leadership acquired/lost. If that is the case,
I think that the consumer should wait for previous go routine to
complete before trying to start new ones.
Also moved 3 JetStream tests to the test/norace_test.go file because
they would consumer several GB of memory when running with the -race flag.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Since we were creating subs on the fly, sub.im would always be nil.
We were passing a client because it was needed in sendRouteSubOrUnSubProtos().
This PR simply fills the buffer with each account's subscriptions.
There is also no need to have subs sent from different go routine
based on some threshold. Routes are no longer subject to max pending.
Some code has been made into a function so that they can be shared
by sendSubsToRoute() and sendRouteSubOrUnSubProtos(). The function
is simply adding to given buffer the RS+/- protocol.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This is how it was up to v2.1.2 included (changed in v2.1.4 onward).
I added a benchmark that has 3 subscribers running and increase
the number of publishers: 1, 2, 5 and 10. This is the comparison
between the pre-PR and post-PR:
```
benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
Benchmark___BumpPubCount_1x3-16 396 385 -2.78%
Benchmark___BumpPubCount_2x3-16 495 406 -17.98%
Benchmark___BumpPubCount_5x3-16 542 395 -27.12%
Benchmark__BumpPubCount_10x3-16 549 515 -6.19%
benchmark old MB/s new MB/s speedup
Benchmark___BumpPubCount_1x3-16 717.27 737.54 1.03x
Benchmark___BumpPubCount_2x3-16 574.31 699.02 1.22x
Benchmark___BumpPubCount_5x3-16 524.35 718.80 1.37x
Benchmark__BumpPubCount_10x3-16 517.26 551.53 1.07x
```
It is inline with what the user reported, seeing a 20% drop in performance
when going from 1 publisher to 2. But, as we can see, the difference
between go channel and cond variable reduces with the increased number
of publishers after a certain number.
I am not sure of the performance impact on other situations, so this
PR is more of a proposal than a fix.
Resolves#1786
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
The server that solicits a LeafNode connection does not send an
INFO, so the accepting side had no way to know if the remote
supports headers or not. The solicit side will now send the headers
support capability in the CONNECT protocol so that the receiving
side can mark the inbound connection with headers support based
on that and its own support for headers.
Resolves#1781
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Just increased the AckWait from 20ms to 100ms and reduced max
deliveries from 4 to 3.
I believe that there is still the risk that the message is redelivered
while the server is being shutdown and that message is not making it
to the sub.
But using those new values (100ms/3), I have ran 200 rounds on a Linux
VM and did not get the failure (but did before the change).
Again, this is not proper test fix, but may help. This test has been
failing 11 times already (keeping track in spreadsheet) and causes
several minutes of tests to have to be recycled.
Note that the test ran in about 0.4s and now 0.7s, so not that much
and would be worth the added delay if it helps not breaking the whole
test suite!
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Presence of TLS config in any remote gateway or leafnode would
cause the config reload to fail (because TLS config internal
content may change which fails the DeepEqual check).
This PR excludes the TLS configs in such case to check for
changes in gateways and leafnodes.
Although GW and LN config reload is technically supported, this
PR updates the internal remotes' TLS configuration so that
changes/updates to TLS certificates would take effect after
a configuration reload.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Had a deadlock with new preconditions. We need to hold lock across Store() call but that call could call into storeUpdate() such that we may need to acquire the lock. We can enter this callback from the storage layer itself and the lock would not be held so added an atomic.
Signed-off-by: Derek Collison <derek@nats.io>
When a system account was configured and not the default when we did a reload we would lose the JetStream service exports.
Signed-off-by: Derek Collison <derek@nats.io>
Made several changes based on feedback.
1. Made PubAckResponse only optionally include an ApiError and not force an API type.
2. Allow FilterSubject to be set on a consumer config and cleared if it matches the only stream subject.
3. Remove LookupStream by subject, and add in filters for stream names API.
Signed-off-by: Derek Collison <derek@nats.io>