38 Commits

Author SHA1 Message Date
Derek Collison
edbaa57e87 Fixes for move test.
The default timeout for JetStream API calls is 10s, so in the case where we determine that we are the leader, but the stream info endpoint has not registered with the server we are connected to, the stream info call could fail and we would exhaust the whole checkFor since we would stay in one call for 10s. Fix is to override and make multiple attempts possible.

Signed-off-by: Derek Collison <derek@nats.io>
2023-09-12 11:38:35 -07:00
Derek Collison
8544cb7adf Merge branch 'main' into dev
Signed-off-by: Derek Collison <derek@nats.io>
2023-08-22 20:04:59 -07:00
Derek Collison
ddb7f9f9d5 Fix for a peer-remove of an R1 that would brick the stream.
Signed-off-by: Derek Collison <derek@nats.io>
2023-08-22 17:45:19 -07:00
Derek Collison
bcf5da04e3 Merge branch 'main' into dev 2023-08-22 06:50:36 -07:00
Derek Collison
e5d208bf33 When moving streams, we could check too soon and be in a gap where the replica peer has not registered a catchup request.
This would cause us to think the replica was caughtup incorrectly and drop our leadership, which would cancel any cacthup requests.

Signed-off-by: Derek Collison <derek@nats.io>
2023-08-21 20:07:48 -07:00
Jean-Noël Moyne
7ff114162c Adds the same check for valid stream name for Mirror
Fix test using invalid stream names

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
2023-06-08 07:49:47 -07:00
Derek Collison
724160ebac Fix flapping tests
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-28 14:30:23 -08:00
Derek Collison
2d794d09e1 Fix to flapping test to make sure we do not quickly blow away all consumer state.
Signed-off-by: Derek Collison <derek@nats.io>
2023-02-17 14:23:34 -08:00
Marco Primi
f8a030bc4a Use testing.TempDir() where possible
Refactor tests to use go built-in temporary directory utility for tests.

Also avoid binding to default port (which may be in use)
2022-12-12 13:18:44 -08:00
Ivan Kozlovic
170ff49837 [ADDED] JetStream: peer (the hash of server name) in statsz/jsz
A request to `$SYS.REQ.SERVER.PING.JSZ` would now return something
like this:
```
...
    "meta_cluster": {
      "name": "local",
      "leader": "A",
      "peer": "NUmM6cRx",
      "replicas": [
        {
          "name": "B",
          "current": true,
          "active": 690369000,
          "peer": "b2oh2L6w"
        },
        {
          "name": "Server name unknown at this time (peerID: jZ6RvVRH)",
          "current": false,
          "offline": true,
          "active": 0,
          "peer": "jZ6RvVRH"
        }
      ],
      "cluster_size": 3
    }
```
Note the "peer" field following the "leader" field that contains
the server name. The new field is the node ID, which is a hash of
the server name.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-16 15:31:37 -06:00
Ivan Kozlovic
29224c8ea9 Split more tests to speed up Travis run
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-09 12:45:48 -06:00
Matthias Hanel
f7cb5b1f0d changed format of JSClusterNoPeers error (#3459)
* changed format of JSClusterNoPeers error

This error was introduced in #3342 and reveals to much information
This change gets rid of cluster names and peer counts.

All other counts where changed to booleans,
which are only included in the output when the filter was hit.

In addition, the set of not matching tags is included.
Furthermore, the static error description in server/errors.json 
is moved into selectPeerError

sample errors:
1) no suitable peers for placement, tags not matched ['cloud:GCP', 'country:US']"
2) no suitable peers for placement, insufficient storage

Signed-off-by: Matthias Hanel <mh@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Co-authored-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-08 18:25:48 -07:00
Ivan Kozlovic
b69ffe244e Fixed some tests
Code change:
- Do not start the processMirrorMsgs and processSourceMsgs go routine
if the server has been detected to be shutdown. This would otherwise
leave some go routine running at the end of some tests.
- Pass the fch and qch to the consumerFileStore's flushLoop otherwise
in some tests this routine could be left running.

Tests changes:
- Added missing defer NATS connection close
- Added missing defer server shutdown

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-08 11:28:23 -06:00
Derek Collison
9c3bd17059 Updates to tests
Signed-off-by: Derek Collison <derek@nats.io>
2022-09-06 13:33:39 -07:00
Derek Collison
b850a95d4c Remove auto-promotion of direct get. Force stream config to set AllowDirect to true.
Signed-off-by: Derek Collison <derek@nats.io>
2022-09-06 13:33:39 -07:00
Ivan Kozlovic
88ece75765 [FIXED] JetStream: Some nodes may never be reported as offline
In some rare situations, it is possible that nodes are added
to the cluster but are not properly tracked and not shown as
offline when they exit the cluster.

Relates to #3258

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-01 12:48:12 -06:00
Ivan Kozlovic
8d1fb4bc92 [FIXED] JetStream: possible routing issues through gateways
Internally jetstream may subscribe to some subject and then send
a request with a reply subject matching that subscription.
Due to interest propagation through a super cluster, it is possible
that the reply comes back to a node that is not yet aware of
the subscription interest which would cause the reply to be dropped.

Some code detects that the subscription is recent and "map" the
reply subject so that it can be routed back to the origin server.
However, this was done with the use of the connection object that
created the subscription, but at the time of the send, a different
internal "*client" object may be used which would then cause
the code to not be aware of the recent subscription and not do
the mapping.

This code was changed to scope at the account level instead of
connection.

A recent change in PR #3412 is no longer needed and was reverted
in favor of changes in this PR.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-31 14:18:28 -06:00
Derek Collison
98bf861a7a Updates to stream and consumer move logic.
Signed-off-by: Derek Collison <derek@nats.io>
2022-08-30 16:11:35 -07:00
Ivan Kozlovic
380fa4499f Merge pull request #3383 from nats-io/gw_switch_to_interest_only_right_away
[CHANGED] Gateway: Switch all accounts to interest-only mode
2022-08-23 08:44:15 -06:00
Ivan Kozlovic
5663bc2fa3 Reduce length of some clustering tests
Since PR #3381, the 2 tests modified here would take twice as
long (around 245 seconds) to complete.
Talking with Matthias, he suggested using a variable instead of
a const and set it to 0 for those 2 tests since they don't really
need that to be set.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-22 12:35:37 -06:00
Ivan Kozlovic
f6c4e5fcee [CHANGED] Gateway: Switch all accounts to interest-only mode
We are phasing out the optimistic-only mode. Servers accepting
inbound gateway connections will switch the accounts to interest-only
mode.

The servers with outbound gateway connection will check interest
and ignore the "optimistic" mode if it is known that the corresponding
inbound is going to switch the account to interest-only. This is
done using a boolean in the gateway INFO protocol.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-19 16:41:44 -06:00
Matthias Hanel
6bf50dbb77 induce delay prior to scale down (#3381)
This is to avoid a narrow race between adding server and them catching
up where they also register as current.

Also wait for all peers to be caught up.

This also avoids clearing catchup marker once catchup stalled.
A stalled catchup would remove the marker causing the peer to
register as current.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-18 13:47:40 -07:00
Matthias Hanel
9892a132e7 Improve StreamMoveInProgressError (#3376)
by adding progress indicators

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-17 15:12:32 -07:00
Matthias Hanel
76219f8e5b fix unit test (#3359)
Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-11 01:46:30 +02:00
Matthias Hanel
c6e37cf7af Fix race between stream stop and monitorStream (#3350)
* Fix race between stream stop and monitorStream

monitorCluster stops the stream, when doing so, monitorStream
needs to be stopped to avoid miscounting of store size.
In a test stop and reset of store size happened first and then
was followed by storing more messages via monitorStream

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-10 19:01:21 +02:00
Matthias Hanel
7015e46dd9 fix move cancel issue where tags and peers diverge (#3354)
This can happen if the move was initiated by the user.
A subsequent cancel resets the initial peer list.
The original peer list was picked on the old set of tags.
A cancel would then keep the new list of tags but reset
to the old peers. Thus tags and peers diverge.

The problem is that at the time of cancel, the old
placement tags can't be found anymore.

This fix causes cancel to remove the placement tags, if
the old peers do not satisfy the new placement tags.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-10 18:48:18 +02:00
Derek Collison
758b733d43 Attempt to improve long RTT catchup time during stream moves.
Signed-off-by: Derek Collison <derek@nats.io>
2022-08-08 11:06:10 -06:00
Matthias Hanel
52c4872666 better error when peer selection fails (#3342)
* better error when peer selection fails

It is pretty hard to diagnose what went wrong when not enough peers for
an operation where found. This change now returns counts of reasons why
peers where discarded.

Changed the error to JSClusterNoPeers as it seems more appropriate
of an error for that operation. Not having enough resources is one of
the conditions for a peer not being considered. But so is having a non
matching tag. Which is why JSClusterNoPeers seems more appropriate
In addition, JSClusterNoPeers was already used as error after one call
to selectPeerGroup already.

example:
no suitable peers for placement: peer selection cluster 'C' with 3 peers
offline: 0
excludeTag: 1
noTagMatch: 2
noSpace: 0
uniqueTag: 0
misc: 0

Examle for mqtt:
mid:12 - "mqtt" - unable to connect: create sessions stream for account "$G":
no suitable peers for placement: peer selection cluster 'MQTT' with 3 peers
        offline: 0
        excludeTag: 0
        noTagMatch: 0
        noSpace: 0
        uniqueTag: 0
        misc: 0
         (10005)

Signed-off-by: Matthias Hanel <mh@synadia.com>

* review comment

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-08-06 00:17:01 +02:00
Derek Collison
28ccaa4371 Direct get across a leafnode using cross domain mappings to a queue subscriber did not work.
The interest moved across the leafnode would be for the mapping, and not the actual qsub.
So when received if we did detect that we are mapped and do not have a queue filter present make sure to ignore.
This will allow queue subscriber processing on the local server that received the message from the leafnode.

Signed-off-by: Derek Collison <derek@nats.io>
2022-08-03 20:21:28 -07:00
Derek Collison
748890adb1 Auto-set and upgrade AllowDirect when MaxMsgsPerSubject is set.
Also allow mirrors to inherit properly.

Signed-off-by: Derek Collison <derek@nats.io>
2022-08-03 12:36:52 -07:00
Ivan Kozlovic
38727417df Moving super-cluster tests from cluster tests file to supercluster file
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-07-27 17:14:19 -06:00
Matthias Hanel
d53d2d0484 [Added] account specific monitoring endpoint(s) (#3250)
Added http monitoring endpoint /accstatz
It responds with a list of statz for all accounts with local connections
the argument "unused=1" can be provided to get statz for all accounts
This endpoint is also exposed as nats request under:

This monitoring endpoint is exposed via the system account.
$SYS.REQ.ACCOUNT.*.STATZ
Each server will respond with connection statistics for the requested
account. The format of the data section is a list (size 1) identical to the event
$SYS.ACCOUNT.%s.SERVER.CONNS which is sent periodically as well as on
connect/disconnect. Unless requested by options, server without the account,
or server where the account has no local connections, will not respond.

A PING endpoint exists as well. The response format is identical to
$SYS.REQ.ACCOUNT.*.STATZ
(however the data section will contain more than one account, if they exist)
In addition to general filter options the request takes a list of accounts and
an argument to include accounts without local connections (disabled by default)
$SYS.REQ.ACCOUNT.PING.STATZ

Each account has a new system account import where the local subject
$SYS.REQ.ACCOUNT.PING.STATZ essentially responds as if
the importing account name was used for $SYS.REQ.ACCOUNT.*.STATZ

The only difference between requesting ACCOUNT.PING.STATZ from within
the system account and an account is that the later can only retrieve
statz for the account the client requests from.

Also exposed the monitoring /healthz via the system account under
$SYS.REQ.SERVER.*.HEALTHZ
$SYS.REQ.SERVER.PING.HEALTHZ
No dedicated options are available for these.
HEALTHZ also accept general filter options.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-07-12 21:50:32 +02:00
Matthias Hanel
f0ee56cf0a Fix unique_tag issue with stream replica increase
When increasing the replica count unique tags for already existing peers
where ignored, which could lead to bad placement

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-07-07 21:22:55 +02:00
Derek Collison
47bef915ed Allow all members of a replicated stream to participate in direct access.
We will wait until a non-leader replica is current to subscribe.

Signed-off-by: Derek Collison <derek@nats.io>
2022-07-03 11:08:24 -07:00
Derek Collison
4075721651 Allow direct msg get for stream to operate in queue group and allows mirrors to opt-in to the same group.
Signed-off-by: Derek Collison <derek@nats.io>
2022-07-02 14:16:55 -07:00
Ivan Kozlovic
53e3c53d96 [FIXED] JetStream: consumer with deliver new may miss messages
This could happen when a consumer had not sent anything to the
attached NATS subscription and there was a consumer leader
step down or server restart.

Signed-off-by: Derek Collison <derek@nats.io>
2022-05-23 12:01:48 -06:00
Derek Collison
f702e279ab Fix for a consumer recovery issue.
Also update healthz to check all assets that are assigned, not just running.

Signed-off-by: Derek Collison <derek@nats.io>
2022-04-26 19:22:19 -07:00
Ivan Kozlovic
06ff4b2b29 Split JS cluster and super clusters tests and compile only on 1.16
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-26 16:24:05 -06:00