5718 Commits

Author SHA1 Message Date
Ivan Kozlovic
49907c4537 [FIXED] Configuration Reload: possible panic if done during Shutdown
If a configuration reload is issued as the server is being shutdown,
we could get 2 different panics. One due to JetStream if an account
is JetStream enabled, and one due to the send to a go channel that
has been closed.

```
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 440 [running]:
testing.tRunner.func1.2({0x1038d58e0, 0x1039e1270})
        /usr/local/go/src/testing/testing.go:1545 +0x274
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1548 +0x448
panic({0x1038d58e0?, 0x1039e1270?})
        /usr/local/go/src/runtime/panic.go:920 +0x26c
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization(0xc00024fb00)
        /Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1998 +0x788
github.com/nats-io/nats-server/v2/server.(*Server).applyOptions(0xc00024fb00, 0xc00021dc00, {0xc00038e4e0, 0x2, 0xc00021dc28?})
        /Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1746 +0x2b8
github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions(0xc000293500?, 0xc000118a80, 0xc000293500)
        /Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1121 +0x178
github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions(0xc00024fb00, 0xc000293500)
        /Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1060 +0x368
github.com/nats-io/nats-server/v2/server.(*Server).Reload(0xc00024fb00)
        /Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:995 +0x104
```

and

```
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x10077b224]

goroutine 8 [running]:
testing.tRunner.func1.2({0x101351640, 0x101b7d2a0})
	/usr/local/go/src/testing/testing.go:1545 +0x274
testing.tRunner.func1()
	/usr/local/go/src/testing/testing.go:1548 +0x448
panic({0x101351640?, 0x101b7d2a0?})
	/usr/local/go/src/runtime/panic.go:920 +0x26c
github.com/nats-io/nats-server/v2/server.(*Account).EnableJetStream(0xc00020fb80, 0xc000220240)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:1045 +0xa4
github.com/nats-io/nats-server/v2/server.(*Server).configJetStream(0xc000226d80, 0xc00020fb80)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:707 +0xdc
github.com/nats-io/nats-server/v2/server.(*Server).configAllJetStreamAccounts(0xc000226d80)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:768 +0x2b0
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStreamAccounts(0xc000226d80?)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:637 +0x128
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization(0xc000226d80)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:2039 +0x93c
github.com/nats-io/nats-server/v2/server.(*Server).applyOptions(0xc000226d80, 0xc000171c50, {0xc000074600, 0x2, 0xc000171c78?})
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1746 +0x2b8
github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions(0xc000276000?, 0xc0000a6000, 0xc000276000)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1121 +0x178
github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions(0xc000226d80, 0xc000276000)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1060 +0x368
github.com/nats-io/nats-server/v2/server.(*Server).Reload(0xc000226d80)
	/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:995 +0x104
```

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-10-16 15:25:02 -06:00
Derek Collison
4df6c9aeb8 [ADDED] TLS: Handshake First for client connections (#4642)
A new option instructs the server to perform the TLS handshake first,
that is prior to sending the INFO protocol to the client.

Only clients that implement equivalent option would be able to connect
if the server runs with this option enabled.

The configuration would look something like this:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: true
}
```

The same option can be set to "auto" or a Go time duration to fallback
to the old behavior. This is intended for deployments where it is known
that not all clients have been upgraded to a client library providing
the TLS handshake first option.

After the delay has elapsed without receiving the TLS handshake from the
client, the server reverts to sending the INFO protocol so that older
clients can connect. Clients that do connect with the "TLS first" option
will be marked as such in the monitoring's Connz page/result. It will
allow the administrator to keep track of applications still needing to
upgrade.

The configuration would be similar to:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: auto
}
```
With the above value, the fallback delay used by the server is 50ms.

The duration can be explcitly set, say 300 milliseconds:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: "300ms"
}
```

It is understood that any configuration other that "true" will result in
the server sending the INFO protocol after the elapsed amount of time
without the client initiating the TLS handshake. Therefore, for
administrators that do not want any data transmitted in plain text, the
value must be set to "true" only. It will require applications to be
updated to a library that provides the option, which may or may not be
readily available.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-10-16 07:49:08 -07:00
Derek Collison
a797f0d794 Add fan-in/out benchmarks (#4660)
Benchmarks for NATS core fan-in and fan-out pattern workloads. 

Signed-off-by: Reuben Ninan <reuben@nats.io>
2023-10-14 10:00:31 -07:00
R.I.Pienaar
d61ecf8a89 Report the raft group name in stream and consumer info
Signed-off-by: R.I.Pienaar <rip@devco.net>
2023-10-14 12:28:36 +03:00
Reuben Ninan
524c1f544a Add fan-in/out benchmarks
Signed-off-by: reubenninan <reuben@nats.io>
2023-10-14 00:56:09 -04:00
Waldemar Quevedo
996bf2bf1c Release v2.10.3
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-10-12 13:46:11 -07:00
Derek Collison
e2414e6a04 Bump to 2.10.3-RC.3
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-12 13:12:19 -07:00
Derek Collison
0a64f18060 Only mark fs as dirty vs full write on mb compaction.
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-12 12:59:19 -07:00
Derek Collison
ea70590aa2 Bump to 2.10.3-RC.2
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-12 12:35:54 -07:00
Derek Collison
b7b40b0a69 Fixed a bug that was not correctly selecting next first because it was not skipping dbit entries.
This could result in lookups failing, e.g. after a change in max msgs per subject to a lower value.

Also fixed a bug that would not prperly update psim during compact when throwing away the whole block and a subject had more than one message.

Signed-off-by: Derek Collison <derek@nats.io>
2023-10-12 10:58:37 -07:00
Derek Collison
1e8f6bf1e1 Fix updating a non unique consumer on workqueue stream not returning an error (#4654)
This is a possible fix for #4653.

Changes made:
1. Added tests for creating and updating consumers on a work queue
stream with overlapping subjects.
2. Check for overlapping subjects before
[updating](a25af02c73/server/consumer.go (L770))
the consumer config.
3. Changed [`func (*stream).partitionUnique(partitions []string)
bool`](a25af02c73/server/stream.go (L5269))
to accept the consumer name being checked so we can skip it while
checking for overlapping subjects (Required for
[`FilterSubjects`](a25af02c73/server/consumer.go (L75))
updates), wasn't needed before because the checks were made on creation
only.

There's only 1 thing that I'm not sure about.

In the [current work queue stream conflict
checks](a25af02c73/server/consumer.go (L796)),
the consumer config `Direct` is being checked if `false`, should we also
make this check before the update?

Signed-off-by: Pierre Mdawar <pierre@mdawar.dev>
2023-10-12 07:27:27 -07:00
Pierre Mdawar
c46d8093bc Fix updating a non unique consumer on workqueue stream not returning an error 2023-10-12 12:18:24 +03:00
Derek Collison
38794e5af9 Bump to 2.10.3-RC.1
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-11 08:26:09 -07:00
Derek Collison
94545f3206 [FIXED] Compaction with compression and added out of band compaction (#4645)
This will also reclaim more space for streams with lots of interior
deletes.


Signed-off-by: Derek Collison <derek@nats.io>
2023-10-11 08:22:10 -07:00
Derek Collison
842d600e3f Grab blk fn while mb lock held
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-11 07:54:36 -07:00
Lev Brouk
de1282c98d Fixed a crash in MQTT outgoing PUBREL
This really was a cut/paste/typo error.

The effect was that when there was a pending PUBREL in JetStream, we would sometimes attempt to deliver it immediately once the client connected, cpending was already initialized, but the pubrel map was not (yet).
2023-10-10 18:08:18 -07:00
Derek Collison
f4387ec74e Fix for compaction with compression and added an out of band compaction in syncBlocks to reclaim more space.
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-10 17:17:55 -07:00
Ivan Kozlovic
ce96de2ed5 [ADDED] TLS: Handshake First for client connections
A new option instructs the server to perform the TLS handshake first,
that is prior to sending the INFO protocol to the client.

Only clients that implement equivalent option would be able to
connect if the server runs with this option enabled.

The configuration would look something like this:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: true
}
```

The same option can be set to "auto" or a Go time duration to fallback
to the old behavior. This is intended for deployments where it is known
that not all clients have been upgraded to a client library providing
the TLS handshake first option.

After the delay has elapsed without receiving the TLS handshake from
the client, the server reverts to sending the INFO protocol so that
older clients can connect. Clients that do connect with the "TLS first"
option will be marked as such in the monitoring's Connz page/result.
It will allow the administrator to keep track of applications still
needing to upgrade.

The configuration would be similar to:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: auto
}
```
With the above value, the fallback delay used by the server is 50ms.

The duration can be explcitly set, say 300 milliseconds:
```
...
tls {
    cert_file: ...
    key_file: ...

    handshake_first: "300ms"
}
```

It is understood that any configuration other that "true" will result
in the server sending the INFO protocol after the elapsed amount of
time without the client initiating the TLS handshake. Therefore, for
administrators that do not want any data transmitted in plain text,
the value must be set to "true" only. It will require applications
to be updated to a library that provides the option, which may or
may not be readily available.

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2023-10-10 09:46:01 -06:00
Byron Ruth
4ab65b1871 Bump v2.10.3
Signed-off-by: Byron Ruth <byron@nats.io>
2023-10-06 16:39:45 -04:00
Byron Ruth
f8c9d8e686 Release v2.10.2
Signed-off-by: Byron Ruth <byron@nats.io>
2023-10-06 15:23:06 -04:00
Derek Collison
0c3609ed2a Bump to 2.10.2-RC.15
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-06 09:58:55 -07:00
Derek Collison
f29c7863e7 [FIXED] Setting initial min on dmap caused subtle bugs with dmap. (#4631)
Under heavy load with max msgs per subject of 1 the dmap, when
considered empty and resetting the initial min, could cause lookup
misses that would lead to excess messages in a stream and longer restore
issues.

Signed-off-by: Derek Collison <derek@nats.io>
2023-10-06 09:58:17 -07:00
Derek Collison
dd646f6b71 Set initial min on dmap caused subtle bugs with dmap. Some minor cleanup.
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-06 09:42:09 -07:00
Lev
beee6fc72a [FIXED] MQTT PUBREL header incompatibility (#4616)
https://hivemq.github.io/mqtt-cli/docs/test/ pointed out the
incompatibility.
2023-10-05 08:07:50 -07:00
Waldemar Quevedo
4e414f1f05 Skip processing consumer assignments after JS has shutdown (#4625)
Signed-off-by: Waldemar Quevedo <wally@nats.io>
2023-10-04 13:17:22 -07:00
Neil Twigg
7124dc7bdc Revert changes to nbPoolPut, force compressor to forget byte buffer
Signed-off-by: Neil Twigg <neil@nats.io>
2023-10-04 17:41:36 +01:00
Neil Twigg
e20ca9043f Don't append empty slices in the unfragmented path
Signed-off-by: Neil Twigg <neil@nats.io>
2023-10-04 17:18:47 +01:00
Neil Twigg
6b65452bc7 Reduce allocations in WebSocket compression
Signed-off-by: Neil Twigg <neil@nats.io>
2023-10-04 12:36:32 +01:00
Derek Collison
dbe700d192 Bump to 2.10.0-RC.14
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-03 16:11:30 -07:00
Derek Collison
3f1afb4ca2 [IMPROVED] Bumped inflight updates to 16 and move one lock to rlock. (#4621)
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-03 16:10:59 -07:00
Derek Collison
2d21bc7008 Fix datarace
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-03 15:35:20 -07:00
Derek Collison
1ccc6dbf30 Bumped inflight updates to 16 and move one lock to rlock.
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-03 15:01:34 -07:00
Derek Collison
2f1a384bcb Holding onto the compressor and not recycling the interbal byte slice was causing havoc with GC.
This needs to be improved but this at least should allow the GC to cleanup more effectively.

Signed-off-by: Derek Collison <derek@nats.io>
2023-10-03 14:39:00 -07:00
Derek Collison
195227edfd Bump to 2.10.0-RC.12
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-02 09:53:30 -07:00
Derek Collison
e4ca15c2c3 Optimize locking for consumer info
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-02 09:22:44 -07:00
Derek Collison
4165f869d2 Bump to 2.10.2-RC.11
Signed-off-by: Derek Collison <derek@nats.io>
2023-10-01 08:18:28 -07:00
Derek Collison
00839280fb [IMPROVED] Reduce contention for high connections in a JetStream enabled account with high API usage. (#4613)
Several strategies are used which are listed below.

1. Checking a RaftNode to see if it is the leader now uses atomics.
2. Checking if we are the JetStream meta leader from the server now uses
an atomic.
3. Accessing the JetStream context no longer requires a server lock,
uses atomic.Pointer.
4. Filestore syncBlocks would hold msgBlock locks during sync, now does
not.

Signed-off-by: Derek Collison <derek@nats.io>
2023-10-01 08:17:15 -07:00
Derek Collison
dba03dbc2f Optimizations to reduce contention for high connections in a JetStream enabled account with high API usage.
Several strategies which are listed below.

1. Checking a RaftNode to see if it is the leader now uses atomics.
2. Checking if we are the JetStream meta leader from the server now uses an atomic.
3. Accessing the JetStream context no longer requires a server lock, uses atomic.Pointer.
4. Filestore syncBlocks would hold msgBlock locks during sync, now does not.

Signed-off-by: Derek Collison <derek@nats.io>
2023-09-30 14:52:15 -07:00
Tomasz Pietrek
1f4b986125 Fix consumer info if consumer was closed
Co-authored-by: Derek Collison <derek@nats.io>
Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Tomasz Pietrek <tomasz@nats.io>
2023-09-29 21:40:55 +02:00
Neil Twigg
212d92ca7e Add more pprof labels to consumers, sources, mirrors
Signed-off-by: Neil Twigg <neil@nats.io>
2023-09-29 19:12:47 +01:00
Derek Collison
720ac605a2 Bump to 2.10.0-RC.10
Signed-off-by: Derek Collison <derek@nats.io>
2023-09-28 14:43:08 -07:00
Derek Collison
c9fa001ebf [IMPROVED] Add in additional warning when subject skew detected (#4606)
Signed-off-by: Derek Collison <derek@nats.io>
2023-09-28 14:42:29 -07:00
Derek Collison
fa5b7afcb6 [FIXED] Do not bypass authorization blocks when turning on $SYS account access (#4605)
Only setup auto no-auth for $G account iff no authorization block was
defined.

Signed-off-by: Derek Collison <derek@nats.io>

Resolves #4535
2023-09-28 14:17:24 -07:00
Derek Collison
cb74f3f26e Add in additional warning when subject skew detected
Signed-off-by: Derek Collison <derek@nats.io>
2023-09-28 14:16:27 -07:00
Derek Collison
2737c56352 Only setup auto no-auth for $G account iff no authorization block was defined.
Signed-off-by: Derek Collison <derek@nats.io>
2023-09-28 13:51:45 -07:00
Derek Collison
3d5564bbb1 [FIXED] Flapping TestMQTTLockedSession (#4604)
A test-only fix.

I can not reproduce the flapping behavior, but did see a race during
debugging suggesting that the CONNACK is delivered to the test before
`mqttProcessConnect` finishes and releases the record.
2023-09-28 13:16:46 -07:00
Lev Brouk
214711654e PR feedback: use checkFor 2023-09-28 12:42:18 -07:00
Lev Brouk
a05d4416ef PR feedback: nit 2023-09-28 12:02:35 -07:00
Derek Collison
783edaa36d [FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. (#4592)
- [X] Tests added
- [X] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [X] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
 - [x] Build is green in Travis CI
- [X] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)

### Changes proposed in this pull request:

Fixes a race condition in some leader failover scenarios leading to
messages being potentially sourced more than once.

In some failure scenarios where the current leader of a stream sourcing
from other stream(s) gets shutdown while publications are happening on
the stream(s) being sourced leads to `setLeader(true)` being called on
the new leader for the sourcing stream before all the messages having
been sourced by the previous leader are completely processed such that
when the new leader does it's reverse scan from the last message in it's
view of the stream in order to know what sequence number to start the
consumer for the stream being sourced from, such that the last
message(s) sourced by the previous leader get sourced again, leading to
some messages being sourced more than once.

The existing `TestNoRaceJetStreamSuperClusterSources` test would
sidestep the issue by relying on the deduplication window in the
sourcing stream. Without deduplication the test is a flapper.

This avoid the race condition by adding a small delay before scanning
for the last message(s) having been sourced and starting the sources'
consumer(s). Now the test (without using the deduplication window) never
fails because more messages than expected have been received in the
sourcing stream.

(Also adds a guard to give up if `setupSourceConsumers()` is called and
we are no longer the leader for the stream (that check was already
present in `setupMirrorConsumer()` so assuming it was forgotten for
`setupSourceConsumers()`)
2023-09-28 11:22:20 -07:00
Lev Brouk
4b59efd6e7 [FIXED] Flapping TestMQTTLockedSession 2023-09-28 11:13:48 -07:00