Commit Graph

278 Commits

Author SHA1 Message Date
Waldemar Quevedo
286a1632ca Use monotonic time for measuring time internally
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-05-12 08:27:46 -07:00
Derek Collison
d107ba3549 Under certain scenarios we have witnessed healthz() that never retrun healthy due to a stream or consumer being missing or stopped.
This will now allow the healthy call to attempt to restart those assets.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-28 17:11:08 -07:00
Derek Collison
cae91b8cad In single server mode healthz could mistake a snapshot staging directory during a restore as an account.
If the restore took a long time, stalled, or was aborted, would cause healthz to fail.

Signed-off-by: Derek Collison <derek@nats.io>
2023-04-24 22:14:04 -07:00
Waldemar Quevedo
d12152c48f Add server name / remote server name to routez
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-04-14 12:47:00 -07:00
Derek Collison
c16915bff4 For checking the health of jetstream, do not hold the lock as we traverse the streams and consumers.
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-06 11:56:55 -07:00
Derek Collison
59175c491f Fix for a datarace
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-03 14:46:57 -07:00
Derek Collison
ff3f102cdd Fix for datarace in healthcheck
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-02 16:30:13 -07:00
Derek Collison
4b8229ee42 Do not hold js lock for health check, use healthy not current for meta
Signed-off-by: Derek Collison <derek@nats.io>
2023-04-02 03:52:54 -07:00
Derek Collison
027f2e42c8 Remove snapshot of cores and maxprocs
Signed-off-by: Derek Collison <derek@nats.io>
2023-03-17 15:09:50 -07:00
Jeremy Saenz
26f241cb62 Updated LEAFZ names to use remoteServer name/id and added is_spoke 2023-02-28 18:09:24 -08:00
Jeremy Saenz
9d4a603aaf Update LEAFZ to include leafnode server/connection name 2023-02-28 14:20:18 -08:00
Waldemar Quevedo
74b703549d Add raft query parameter to /jsz to include raft group info
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-02-27 05:42:11 -08:00
Neil Twigg
68961ffedd Refactor ipQueue to use generics, reduce allocations 2023-02-21 14:50:09 +00:00
Neil Twigg
83932b4be6 Don't mark a clustered stream as unhealthy if making forward progress, add TestJetStreamClusterCurrentVsHealth 2023-01-26 16:57:34 +00:00
Derek Collison
2aeb5e2c5a Update snapshots to numCores and maxProcs after maxrocs.Set()
Signed-off-by: Derek Collison <derek@nats.io>
2023-01-20 11:30:43 -08:00
Derek Collison
713f632fa7 If a stream's meta was not properly written but the file existed, we could re-add the stream but a subsequent restart would lose the stream again.
Also added in healthz for single server systems to make sure all stream directories resulted in recovered streams.

Signed-off-by: Derek Collison <derek@nats.io>
2022-12-29 20:08:56 -08:00
Byron Ruth
1477b675ff Add back existing HealthzOptions.JSEnabled field
This fixes a backwards incompat change for library usage as well as
using the healthz NATS API which depends on the JSON payload.

Signed-off-by: Byron Ruth <byron@nats.io>
2022-12-26 08:45:22 -05:00
Byron Ruth
566d1adfa7 Fix /healthz?js-enabled=true behavior
When js-enabled is set to true, the condition was only checked if
the `getJetStream()` call returned `nil`. However, if it non-nil,
all remaining checks were executed, including assessing the health
of the assets (streams and consumers).

This change addresses two issues:

- Switch to use `js.isEnabled()` which will check whether the value
  is nil OR `js.disabled = true` which can occur if the subsystem
  is temporarily disabled (insufficient resources).
- Correctly exit the check after the assertion and before meta and
  asset checks are performed.

In addition, the option has been renamed to `js-enabled-only` to align
with the `js-server-only` naming. The previous `js-enabled` name still
works, but is mapped to this new option. A warning is emitted noting
the previous option is deprecated.

Fix #3703

Signed-off-by: Byron Ruth <b@devel.io>
2022-12-10 07:34:32 -05:00
Raymond
4d8964e57b Added stream created timestamp to stream detail 2022-11-17 13:59:58 +01:00
Ivan Kozlovic
170ff49837 [ADDED] JetStream: peer (the hash of server name) in statsz/jsz
A request to `$SYS.REQ.SERVER.PING.JSZ` would now return something
like this:
```
...
    "meta_cluster": {
      "name": "local",
      "leader": "A",
      "peer": "NUmM6cRx",
      "replicas": [
        {
          "name": "B",
          "current": true,
          "active": 690369000,
          "peer": "b2oh2L6w"
        },
        {
          "name": "Server name unknown at this time (peerID: jZ6RvVRH)",
          "current": false,
          "offline": true,
          "active": 0,
          "peer": "jZ6RvVRH"
        }
      ],
      "cluster_size": 3
    }
```
Note the "peer" field following the "leader" field that contains
the server name. The new field is the node ID, which is a hash of
the server name.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-09-16 15:31:37 -06:00
Waldemar Quevedo
46d73eddae js: add per account reserved mem/store bytes
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2022-09-06 16:43:10 -07:00
Ivan Kozlovic
03ac1f256f Update based on code review
- Change finger_prints to cert_sha256 and use hex.EncodeToString
- Add spki_sha256 for RawSubjectPublicKeyInfo with hex.EncodeToString

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-24 14:16:37 -06:00
Ivan Kozlovic
d2784589a0 Change json tag name to finger_prints
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-22 12:40:20 -06:00
Ivan Kozlovic
951b7c38f6 [ADDED] Monitoring: TLS Peer Certificates in Connz when auth is on
Add basic peer certificates information in /connz endpoint when
the "auth" option is provided.

Resolves #3317

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-22 11:48:49 -06:00
Ivan Kozlovic
5d3ee8ebf4 [FIXED] Gateway: possible panic if monitor endpoint inspected too soon
The monitoring http server is started early and the gateway setup
(when configured) may not be fully ready when the `/gatewayz`
endpoint is inspected and could cause a panic.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-08-17 13:30:58 -06:00
Ivan Kozlovic
a4bf4e87f6 Merge pull request #3326 from mfaizanse/health_endpoint_params
Added param options to /healthz endpoint
2022-08-09 08:49:22 -06:00
Muhammad Faizan
1634f33de7 Added param options to /healthz endpoint 2022-08-09 08:32:54 +02:00
Derek Collison
2120be6476 nit: Cap stats
Signed-off-by: Derek Collison <derek@nats.io>
2022-08-05 07:52:23 -07:00
Matthias Hanel
d53d2d0484 [Added] account specific monitoring endpoint(s) (#3250)
Added http monitoring endpoint /accstatz
It responds with a list of statz for all accounts with local connections
the argument "unused=1" can be provided to get statz for all accounts
This endpoint is also exposed as nats request under:

This monitoring endpoint is exposed via the system account.
$SYS.REQ.ACCOUNT.*.STATZ
Each server will respond with connection statistics for the requested
account. The format of the data section is a list (size 1) identical to the event
$SYS.ACCOUNT.%s.SERVER.CONNS which is sent periodically as well as on
connect/disconnect. Unless requested by options, server without the account,
or server where the account has no local connections, will not respond.

A PING endpoint exists as well. The response format is identical to
$SYS.REQ.ACCOUNT.*.STATZ
(however the data section will contain more than one account, if they exist)
In addition to general filter options the request takes a list of accounts and
an argument to include accounts without local connections (disabled by default)
$SYS.REQ.ACCOUNT.PING.STATZ

Each account has a new system account import where the local subject
$SYS.REQ.ACCOUNT.PING.STATZ essentially responds as if
the importing account name was used for $SYS.REQ.ACCOUNT.*.STATZ

The only difference between requesting ACCOUNT.PING.STATZ from within
the system account and an account is that the later can only retrieve
statz for the account the client requests from.

Also exposed the monitoring /healthz via the system account under
$SYS.REQ.SERVER.*.HEALTHZ
$SYS.REQ.SERVER.PING.HEALTHZ
No dedicated options are available for these.
HEALTHZ also accept general filter options.

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-07-12 21:50:32 +02:00
Derek Collison
e6479dafd2 Close leafnode connection when same cluster name detected
Signed-off-by: Derek Collison <derek@nats.io>
2022-06-30 15:34:22 -07:00
Ivan Kozlovic
5261d98781 [ADDED] Monitoring: Routez's individual route has now more info
Added Start, LastActivity, Uptime and Idle that we normally have
in a Connz for non route connections. This info can be useful
to determine if a route is recent, etc..

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-05-18 13:18:53 -06:00
Ivan Kozlovic
5050092468 [FIXED] JetStream: possible lock inversion
When updating usage, there is a lock inversion in that the jetStream
lock was acquired while under the stream's (mset) lock, which is
not correct. Also, updateUsage was locking the jsAccount lock, which
again, is not really correct since jsAccount contains streams, so
it should be jsAccount->stream, not the other way around.

Removed the locking of jetStream to check for clustered state since
js.clustered is immutable.

Replaced using jsAccount lock to update usage with a dedicated lock.

Originally moved all the update/limit fields in jsAccount to new
structure to make sure that I would see all code that is updating
or reading those fields, and also all functions so that I could
make sure that I use the new lock when calling these. Once that
works was done, and to reduce code changes, I put the fields back
into jsAccount (although I grouped them under the new usageMu mutex
field).

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-05-02 09:50:32 -06:00
Derek Collison
f702e279ab Fix for a consumer recovery issue.
Also update healthz to check all assets that are assigned, not just running.

Signed-off-by: Derek Collison <derek@nats.io>
2022-04-26 19:22:19 -07:00
Ivan Kozlovic
50c3986863 [FIXED] JetStream stream catchup issues
- A stream could become leader when it should not, causing
messages to be lost.
- A catchup could stall because the server sending data
could bail out of the runCatchup routine but still send
the EOF signal.
- Deadlock with monitoring of Jsz

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-12 16:05:12 -06:00
Ivan Kozlovic
9e6f965913 [ADDED] LeafNode min_version new option
If set, a server configured to accept leafnode connections will
reject a remote server whose version is below that value. Note
that servers prior to v2.8.0 are not sending their version
in the CONNECT protocol, which means that anything below 2.8.0
would be rejected.

Configuration example:
```
leafnodes {
    port: 7422
    min_version: 2.8.0
}
```
The option is a string and can have the "v" prefix:
```
min_version: "v2.9.1"
```
Note that although suffix such as `-beta` would be accepted,
only the major, minor and update are used for the version comparison.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-06 18:40:33 -06:00
Ivan Kozlovic
14f54b8dd7 [ADDED] Monitoring: MQTT and Websocket blocks in /varz endpoint
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-04-04 10:11:55 -06:00
Ivan Kozlovic
34650e9dd5 Fixed data race and some flappers
Data race that has been seen:
```
Read at 0x00c00134bec0 by goroutine 159:
  github.com/nats-io/nats-server/v2/server.(*client).msgHeaderForRouteOrLeaf()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:2935 +0x254
  github.com/nats-io/nats-server/v2/server.(*client).processMsgResults()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4364 +0x2147
(...)
Previous write at 0x00c00134bec0 by goroutine 201:
  github.com/nats-io/nats-server/v2/server.(*Server).addRoute()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:1475 +0xdb4
  github.com/nats-io/nats-server/v2/server.(*client).processRouteInfo()
      /home/travis/gopath/src/github.com/nats-io/nats-server/server/route.go:641 +0x1704
```

Also fixed some flappers and removed use of `s.js.` since we have
already captured `js` in Jsz monitoring.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-31 10:05:34 -06:00
R.I.Pienaar
4c4aa3e87f skips jsz on non js machines when leader only requested
This is a regression introduced in 055703f4fa
that leads to panics in management tooling

Signed-off-by: R.I.Pienaar <rip@devco.net>
2022-03-31 12:02:09 +02:00
Samuel Torres
9868bb71a7 Add logs to healthcheck handler
Kubernetes probes don't use nor log the reponse body of health
endpoints. This means that for some reason a nats node running in
Kubernetes becomes on a Not Ready state we won't have a way to know why
other than to manually access the cluster and call the /healthz endpoint
manually and see the error.

This change adds an error log so we can observe what is going wrong with
a nats node that is not ready.

Signed-off-by: Samuel Torres <samuel.torres@form3.tech>
2022-03-30 14:14:22 +01:00
Matthias Hanel
0c5f3688a7 [ADDED] Tiered limits and fix limit issues on updates (#2945)
* Adding tiered limits and fix limit issues on updates

Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-03-28 20:47:54 -04:00
R.I.Pienaar
055703f4fa ensures the cluster info in jsz is sent from the leader only
The data from other nodes are usually wrong, this can be quite
confusing for users so we now only send it when we are the leader

Signed-off-by: R.I.Pienaar <rip@devco.net>
2022-03-25 18:27:35 +01:00
Ivan Kozlovic
a23b1b73ef Merge pull request #2931 from nats-io/ipq_changes
Changes to IPQueues
2022-03-17 19:13:02 -06:00
Ivan Kozlovic
c3da392832 Changes to IPQueues
Removed the warnings, instead have a sync.Map where they are
registered/unregistered and can be inspected with an undocumented
monitor page.
Added the notion of "in progress" which is the number of messages
that have beend pop()'ed. When recycle() is invoked this count
goes down.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-17 17:53:06 -06:00
Derek Collison
fa098f1af0 Show version on main monitoring page with link to source
Signed-off-by: Derek Collison <derek@nats.io>
2022-03-17 11:04:11 -07:00
Ivan Kozlovic
2c0f5046f1 Merge pull request #2923 from nats-io/gw_detect_duplicate_srv_name
[CHANGED] Gateway: Detect duplicate names between clusters
2022-03-17 10:57:08 -06:00
Derek Collison
287b567b1c Add consumer check to healthz and allow to be called directly
Signed-off-by: Derek Collison <derek@nats.io>
2022-03-16 20:52:31 -07:00
Ivan Kozlovic
63c750e295 [CHANGED] Gateway: Detect duplicate names between clusters
Gateway connection will be closed and error reported if a remote
has a name that is a duplicate of the local cluster.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2022-03-15 15:00:13 -06:00
Matthias Hanel
d0c183106a Fixed lock inversion by not using account lock to get the name
Signed-off-by: Matthias Hanel <mh@synadia.com>
2022-03-07 21:22:41 -05:00
Derek Collison
037e3c6bbe Spiffied up monitoring landing page a bit
Signed-off-by: Derek Collison <derek@nats.io>
2022-03-05 09:18:07 -08:00
Ivan Kozlovic
7f81f2d4c6 Merge pull request #2816 from nats-io/revocation-issue-442
Fix jwt based user/activation token revocation and granularity
2022-01-25 13:42:14 -07:00