Commit Graph

7931 Commits

Author SHA1 Message Date
Neil
d474e3b725 [ADDED] $SYS server request to 'kick' or 'LDM' a client connection (#4298)
- [X] Link to issue, e.g. `Resolves #NNN`
- [X] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [ ] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
 - [x] Build is green in Travis CI
- [X] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)

Resolves #1556

### Changes proposed in this pull request:

Adds tw new $SYS server API endpoints:

- `$SYS.REQ.SERVER.%s.KICK` (where %s is the server_id) which 'kicks'
(effectiveley 'rebalance' as the client application reconnects itself
right away (potentially to another server in the cluster)). The service
takes a JSON payload containing either an "id" or a "name" field. "id"
disconnects the client connection id, "name" disconnects _all_ of the
clients connected to the server with that name.

- `$SYS.REQ.SERVER.%s.LDM` (where %s is the server_id) and takes a JSON
payload containing either an "id" or a "name" field. "id" sends an LDM
Info message to the client connection id, "name" sends an LDM Info
message to _all_ of the clients connected to the server with that name.

This features allow administrators to manually 're-balance' client
connections between the servers in the cluster (e.g. after a rolling
upgrade of the servers where one server ends up with no client
connections after the upgrade), by kicking some of the client
connections from one of the 'overloaded' (in comparison to other
servers) servers in the cluster, causing them to re-estalibsh their
connection to (hopefully) another server.
2023-08-11 09:39:42 +01:00
Jean-Noël Moyne
fc41ab1a5a Adds LDM and KICK server $SYS requests
Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
2023-08-10 17:08:09 -07:00
Waldemar Quevedo
37d3220dfb test: fixes for TestLeafNodeSlowConsumer (#4388)
It would fail sometimes locally otherwise...
```
=== RUN   TestLeafNodeSlowConsumer
    leafnode_test.go:7069: got: 0, expected: 1
--- FAIL: TestLeafNodeSlowConsumer (0.29s)
=== RUN   TestLeafNodeSlowConsumer
    leafnode_test.go:7069: got: 0, expected: 1
--- FAIL: TestLeafNodeSlowConsumer (0.28s)
=== RUN   TestLeafNodeSlowConsumer
--- PASS: TestLeafNodeSlowConsumer (0.28s)
=== RUN   TestLeafNodeSlowConsumer
    leafnode_test.go:7069: got: 0, expected: 1
--- FAIL: TestLeafNodeSlowConsumer (0.28s)
=== RUN   TestLeafNodeSlowConsumer
--- PASS: TestLeafNodeSlowConsumer (0.28s)
```
2023-08-10 01:12:21 -07:00
Waldemar Quevedo
9e1e92e325 test: update TestWSTLSVerifyClientCert for go1.21 (#4387)
Similar to #4380 , this TLS error message changed in Go 1.21.
2023-08-10 01:05:38 -07:00
Waldemar Quevedo
f16582e2a4 test: update TestWSTLSVerifyClientCert for go1.21
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 21:50:46 -07:00
Waldemar Quevedo
7c9ea91296 test: fix TestLeafNodeSlowConsumer flake
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 21:35:24 -07:00
Waldemar Quevedo
0f7fa284cc test: fix TestAccountImportSubjectMapping hanging build (#4386)
Added timeout to test to prevent running into go test timeout in case
messages did not arrive on time.

```
=== RUN   TestAccountImportSubjectMapping
panic: test timed out after 30m0s
goroutine 85 [chan receive, 29 minutes]:
github.com/nats-io/nats-server/v2/test.TestAccountImportSubjectMapping(0xc000007d40)
	/workspace/build/buildkite/synadia/nats-server-go-v1-21/test/accounts_cycles_test.go:466 +0x5d2
testing.tRunner(0xc000007d40, 0x11e1818)
	/usr/local/go/src/testing/testing.go:1595 +0x239
created by testing.(*T).Run in goroutine 1
	/usr/local/go/src/testing/testing.go:1648 +0x82b
```
2023-08-09 20:13:10 -07:00
Waldemar Quevedo
05e2fa9373 test: fix TestAccountImportSubjectMapping hanging build
Added timeout to test to prevent running into go test timeout
in case messages did not arrive on time.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 19:49:24 -07:00
Waldemar Quevedo
8ad592a20d test: fix TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes flake (#4385)
This can sometimes go just above 50ms but have never seen it slower than
90ms:

```
=== RUN   TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes
    norace_test.go:4078: Took too long to snapshot: 50.838542ms
--- FAIL: TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes (7.64s)
=== RUN   TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes
    norace_test.go:4078: Took too long to snapshot: 50.920709ms
--- FAIL: TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes (7.06s)
=== RUN   TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes
    norace_test.go:4078: Took too long to snapshot: 62.469125ms
--- FAIL: TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes (6.25s)
=== RUN   TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes
    norace_test.go:4078: Took too long to snapshot: 69.397834ms
--- FAIL: TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes (6.49s)
=== FAIL: server TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes (5.66s)
    norace_test.go:4078: Took too long to snapshot: 81.595512ms
```
2023-08-09 17:42:08 -07:00
Waldemar Quevedo
3cec8dc451 test: fix TestNoRaceJetStreamMemstoreWithLargeInteriorDeletes flake
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 13:33:48 -07:00
Waldemar Quevedo
5c538671f7 test: delay slightly between filestore test permutations (#4382)
This is to try to prevent test failing due to trying to access the
tempdir while it is being torn down.
(go issue: https://github.com/golang/go/issues/43547)

```
=== RUN   TestFileStoreMsgBlkFailOnKernelFaultLostDataReporting/AES-GCM-S2
    testing.go:1225: TempDir RemoveAll cleanup: unlinkat ./TestFileStoreMsgBlkFailOnKernelFaultLostDataReportingAES-GCM-S23605508670/001/msgs: directory not empty
--- FAIL: TestFileStoreMsgBlkFailOnKernelFaultLostDataReporting (0.02s)
```

Also increases timeout slightly of `TestFileStoreNewWriteIndexInfo`
which runs close to 1ms deadline sometimes:

```
=== FAIL: server TestFileStoreNewWriteIndexInfo/None-None (4.85s)
  | filestore_test.go:5489: Unexpected elapsed time: 1.054065ms
  | --- FAIL: TestFileStoreNewWriteIndexInfo/None-None (4.85s)
```
2023-08-09 13:21:10 -07:00
Waldemar Quevedo
af766b78ce test: bump timeout from TestFileStoreNewWriteIndexInfo to 3ms
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 11:16:44 -07:00
Waldemar Quevedo
4625234bba test: delay slightly between filestore test permutations
This is to try to prevent test failing due to trying to access
the tempdir while it is being tore down.
(go issue: https://github.com/golang/go/issues/43547)

```
=== RUN   TestFileStoreMsgBlkFailOnKernelFaultLostDataReporting/AES-GCM-S2
    filestore_test.go:5195: ------------> 128
    testing.go:1225: TempDir RemoveAll cleanup: unlinkat ./TestFileStoreMsgBlkFailOnKernelFaultLostDataReportingAES-GCM-S23605508670/001/msgs: directory not empty
--- FAIL: TestFileStoreMsgBlkFailOnKernelFaultLostDataReporting (0.02s)
```

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 10:53:13 -07:00
Neil
f13dd59311 [ADDED] Checking HEALTHZ for specific accounts/stream (#4097)
This PR adds `account` and `stream` options to `HEALTHZ` system request
and `/healthz` monitoring endpoint.
It allows for checking health of a specific stream, without having to
rely on other streams, which (when under stress) may not have reached
consensus yet and would return an error.

Additionally, `HealthzOptions.Details` alters the response, returning an
array of errors containing each skewed stream/consumer, e.g.:

```json
{
    "status": "error",
    "status_code": 500,
    "errors": [
        {
            "type": "STREAM",
            "account": "js",
            "stream": "test:123",
            "error": "JetStream stream js \u003e test:123 is not current"
        },
        {
            "type": "STREAM",
            "account": "js",
            "stream": "test:125",
            "error": "JetStream stream js \u003e test:125 is not current"
        },
        {
            "type": "STREAM",
            "account": "js",
            "stream": "test:42",
            "error": "JetStream stream js \u003e test:42 is not current"
        },
        {
            "type": "STREAM",
            "account": "js",
            "stream": "test:126",
            "error": "JetStream stream js \u003e test:126 is not current"
        },
        {
            "type": "STREAM",
            "account": "js",
            "stream": "test:128",
            "error": "JetStream stream js \u003e test:128 is not current"
        }
    ]
}
```
2023-08-09 16:52:54 +01:00
Piotr Piotrowski
27dc50eb8f [ADDED] Filter Healthz results based on stream and consumer names, add 'details` param
Signed-off-by: Piotr Piotrowski <piotr@synadia.com>
2023-08-09 16:44:45 +02:00
Waldemar Quevedo
65e8db731c Track slow consumers per connection type (#4330)
Adds a new `slow_consumer_stats` field to varz to get more details about
the types of connections that are becoming slow consumers:

```
"slow_consumer_stats": {
    "clients": 0,
    "routes": 0,
    "gateways": 0,
    "leafs": 0
  }
```
2023-08-09 06:46:15 -07:00
Waldemar Quevedo
8b7dfe7d74 monitoring: track slow consumers per connection type
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-09 05:57:42 -07:00
Neil
514588935d Allow switching from limits-based to interest-based retention in stream update (#4361)
This should make it possible to switch from limits-based retention to
interest-based retention on an existing stream.

Signed-off-by: Neil Twigg <neil@nats.io>
2023-08-09 12:42:05 +01:00
Neil Twigg
d7f76da597 Allow switching from limits-based to interest-based retention in stream update
Signed-off-by: Neil Twigg <neil@nats.io>
2023-08-09 11:46:49 +01:00
Neil
6eb77fd46b test: fix TestAccountImportCycle flake (#4381)
Add extra flushes to make test more precise and try to avoid timeouts

```
=== RUN   TestAccountImportCycle
    accounts_test.go:3447: require no error, but got: nats: timeout
--- FAIL: TestAccountImportCycle (1.01s)
```
2023-08-09 11:39:52 +01:00
Neil
617d69d6c7 Match --signal PIDs with globular-style expression. (#4370)
When multiple instances are running on the machine a PID argument
suffixed with a '*' character will signal all matching PIDs.

Example: `nats-server --signal reload=*`

 - [ ] Link to issue, e.g. `Resolves #NNN`
 - [ ] Documentation added (if applicable)
 - [X] Tests added
 - [X] Branch rebased on top of current ~~main~~ dev
- [X] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
 - [ ] Build is green in Travis CI
- [X] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)
2023-08-09 11:16:56 +01:00
Neil
1e3e88b528 test: fix TestMQTTTLSVerifyAndMap on Go 1.21 (#4380)
Reported error changed slightly in Go 1.21

```
=== RUN   TestMQTTTLSVerifyAndMap
=== RUN   TestMQTTTLSVerifyAndMap/no_filtering,_client_does_not_provide_cert
    mqtt_test.go:1033: Unexpected error: Error reading: remote error: tls: certificate required
--- FAIL: TestMQTTTLSVerifyAndMap (0.04s)
```
2023-08-09 10:44:50 +01:00
Waldemar Quevedo
14a56e28dd test: fix TestAccountImportCycle flake
add extra flushes to make test more precise

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-08 23:41:18 -07:00
Waldemar Quevedo
e68c411b74 test: fix TestMQTTTLSVerifyAndMap on Go 1.21
reported error changed slightly in Go 1.21

```
=== RUN   TestMQTTTLSVerifyAndMap
=== RUN   TestMQTTTLSVerifyAndMap/no_filtering,_client_does_not_provide_cert
    mqtt_test.go:1033: Unexpected error: Error reading: remote error: tls: certificate required
--- FAIL: TestMQTTTLSVerifyAndMap (0.04s)
```

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-08 23:10:29 -07:00
Waldemar Quevedo
6703bd7ee3 test: fix TestFileStoreNewWriteIndexInfo hanging (#4378)
`t.Fatalf` being called while holding a lock would sometimes leave
builds hanging until test timeout.

```
=== RUN   TestFileStoreNewWriteIndexInfo/AES-GCM-None
=== RUN   TestFileStoreNewWriteIndexInfo/AES-GCM-S2
    filestore_test.go:5483: require true, but got false
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
```
2023-08-08 17:40:24 -07:00
Waldemar Quevedo
1492cf717f test: fix TestFileStoreNewWriteIndexInfo hanging
t.Fatalf being called while holding a lock would
sometimes leave builds hanging.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-08 16:41:15 -07:00
Waldemar Quevedo
961c0d7187 Add Go 1.20 to Travis and Nightly images (#4336)
Picks up https://github.com/nats-io/nats-server/pull/4297 into main.
Includes:

- Using Go 1.20 for the nightly images and Travis tests
- Drop Go 1.18
- Updates to GitHub Actions
- Upgrade to golang-ci
2023-08-08 10:36:23 -07:00
Waldemar Quevedo
0ffd455e32 test: update TestNoRaceJetStreamServiceImportAccountSwapIssue flake (#4376)
Let pull consumer in test fetch messages for slightly longer instead of
at the same time as the producer, to avoid failing due to missing a few
messages:

```
=== RUN   TestNoRaceJetStreamServiceImportAccountSwapIssue
    norace_test.go:1194: Expected to receive 14982 msgs, only got 14981
--- FAIL: TestNoRaceJetStreamServiceImportAccountSwapIssue (3.03s)
```
2023-08-08 02:01:44 -07:00
Tomasz Pietrek
b57675b24d Fix race in consumer create (#4377)
This fixes the race condition in consumer create API by adding a missing
return statement, probably introduced while solving conflicts.

Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-08-08 10:36:57 +02:00
Waldemar Quevedo
b081f8c2ea test: update TestNoRaceJetStreamServiceImportAccountSwapIssue flake
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-08 01:07:19 -07:00
Tomasz Pietrek
54fe8cb14f Fix race in consumer create
Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-08-08 09:16:44 +02:00
Sylvain Rabot
64b2f5b364 Add Go 1.20 to Travis
- Use golang-ci in go test workflow

Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr>
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-07 17:12:20 -07:00
Waldemar Quevedo
ab5eeff1c3 test: bump timeout from TestAccountReloadServiceImportPanic (#4374)
It can take slightly longer in Travis close to the deadline so bumping
it for this test:

```
=== RUN   TestAccountReloadServiceImportPanic
--- PASS: TestAccountReloadServiceImportPanic (10.60s)
=== RUN   TestAccountReloadServiceImportPanic
    accounts_test.go:3621: Have not received all responses, want 187876 got 182649
--- FAIL: TestAccountReloadServiceImportPanic (14.09s)
```
2023-08-07 17:05:08 -07:00
Waldemar Quevedo
59b82198b6 test: fix TestClusterTLSMixedIPAndDNS test in +go1.20 (#4373)
Test would fail now with the leafnode not being able to connect due to
the following:

```
[4257] [INF] 127.0.0.1:63538 - lid:6 - Leafnode connection created for account: $G 
[4257] [INF] 127.0.0.1:63547 - lid:6 - Leafnode connection created 
[4257] [DBG] 127.0.0.1:63547 - lid:6 - Starting TLS leafnode server handshake
[4257] [DBG] 127.0.0.1:63538 - lid:6 - Starting TLS leafnode client handshake
[4257] [ERR] 127.0.0.1:63538 - lid:6 - TLS leafnode handshake error: x509: certificate is not valid for any names, but wanted to match localhost
[4257] [INF] 127.0.0.1:63538 - lid:6 - Leafnode connection closed: TLS Handshake Failure - Account: $G
[4257] [ERR] 127.0.0.1:63547 - lid:6 - TLS leafnode handshake error: remote error: tls: bad certificate
[4257] [INF] 127.0.0.1:63547 - lid:6 - Leafnode connection closed: TLS Handshake Failure
```
2023-08-07 17:04:27 -07:00
Waldemar Quevedo
2630e9b597 test: bump timeout from TestAccountReloadServiceImportPanic
It can take slightly longer in a testing environment.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-07 16:42:12 -07:00
Waldemar Quevedo
9d43fb9606 test: fix TestClusterTLSMixedIPAndDNS test on +go1.20
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-07 15:11:49 -07:00
Jason Volk
9c4ae764a1 Match --signal PIDs with globular-style expression.
When multiple instances are running on the machine a PID argument suffixed with
a '*' character will signal all matching PIDs.

Example: `nats-server --signal reload=*`

Signed-off-by: Jason Volk <jason@zemos.net>
2023-08-07 10:16:05 -07:00
Derek Collison
6ca7887992 [IMPROVED] Delete blocks performance (#4371)
Track deleted with single avl.SeqSet dmap for now vs old method for
memory store.

For fileStore, we were trying to be too smart to save space at the
expense of encoding time, so revert back to simple version that is much
100x faster.
 
Size of encoding may be a bit bigger then we wanted, but we want to
prefer speed over size.

Signed-off-by: Derek Collison <derek@nats.io>
2023-08-07 09:18:48 -07:00
Waldemar Quevedo
abe0791313 Fixes to service system imports on reload also when using custom system account (#4372)
Adds back the fix from #4369 and also fixes the export that was going
missing in dev branch when a custom system account was being used.
2023-08-07 09:02:48 -07:00
Neil
c3f256ded6 Add consumer api action (#4217)
Add distinction between create and update to consumer API

As in the server there is only one API for consumer management create
and update,
if clients want to provide to the users guard against overriding
existing consumer with create operation, or accidentaly creating them
with update, they need to rely on calling `Info`.
That adds latency, traffic and load on the server and is still race'y,
as state on the server can change between the info and create calls.

This PR adds `Action` to CreateConsumerRequest, which is a non-breaking
change that allows client's to present it's intent without spliting
Consumer API into create and update.

This is not a prefect solution, but such split, to not be breaking and
does not require new API version.

TODO:
- [x] Add concrete error types to errors.json and use them
- [ ] Add ADR (after LGTM)

Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-08-07 10:55:57 +01:00
Jean-Noël Moyne
2d5c5d68ce Adds a few tests to verify that addConsumerWithAction also works for named ephemeral consumers as well as for durables
Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
2023-08-07 08:28:21 +02:00
Tomasz Pietrek
d105e68c96 Add consumer api action for create and update
Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-08-07 08:28:21 +02:00
Waldemar Quevedo
6b9008c1f4 Fixes to service imports on reload
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-05 18:21:01 -07:00
Derek Collison
75e1171bdd No longer compacting multiple blocks, so remove test check
Signed-off-by: Derek Collison <derek@nats.io>
2023-08-05 13:20:38 -07:00
Derek Collison
3b235059fa We were trying to be too smart to save space at the expense of encoding time for filestore.
Revert back to very simple but way faster method. Sometimes 100x faster and only ~8% size increase.

Signed-off-by: Derek Collison <derek@nats.io>
2023-08-05 12:33:30 -07:00
Derek Collison
1f00d0e3f2 Track deleted with single avl.SeqSet dmap for now vs old method.
Size of encoding may be a bit bigger then we wanted, but still way better then old method and very fast.

Signed-off-by: Derek Collison <derek@nats.io>
2023-08-05 12:32:29 -07:00
Waldemar Quevedo
0e7394a788 Remove reload fix from main (#4369)
The fix from #4360 will not work for v2.10 branch features so removing
from dev and working on a different PR.
2023-08-04 17:29:54 -07:00
Waldemar Quevedo
eecb8af997 Remove reload fix from main
This workaround will not work for v2.10 branch features

Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-08-04 16:57:39 -07:00
Derek Collison
c0c9633024 Fix for flapping test
Signed-off-by: Derek Collison <derek@nats.io>
2023-08-04 15:13:44 -07:00
Derek Collison
20532c28dd Merge branch 'main' into dev 2023-08-04 12:03:13 -07:00