If a configuration reload is issued as the server is being shutdown,
we could get 2 different panics. One due to JetStream if an account
is JetStream enabled, and one due to the send to a go channel that
has been closed.
```
panic: send on closed channel [recovered]
panic: send on closed channel
goroutine 440 [running]:
testing.tRunner.func1.2({0x1038d58e0, 0x1039e1270})
/usr/local/go/src/testing/testing.go:1545 +0x274
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1548 +0x448
panic({0x1038d58e0?, 0x1039e1270?})
/usr/local/go/src/runtime/panic.go:920 +0x26c
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization(0xc00024fb00)
/Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1998 +0x788
github.com/nats-io/nats-server/v2/server.(*Server).applyOptions(0xc00024fb00, 0xc00021dc00, {0xc00038e4e0, 0x2, 0xc00021dc28?})
/Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1746 +0x2b8
github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions(0xc000293500?, 0xc000118a80, 0xc000293500)
/Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1121 +0x178
github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions(0xc00024fb00, 0xc000293500)
/Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1060 +0x368
github.com/nats-io/nats-server/v2/server.(*Server).Reload(0xc00024fb00)
/Users/ik/dev/go/src/github.com/nats-io/nats-server/server/reload.go:995 +0x104
```
and
```
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x10077b224]
goroutine 8 [running]:
testing.tRunner.func1.2({0x101351640, 0x101b7d2a0})
/usr/local/go/src/testing/testing.go:1545 +0x274
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1548 +0x448
panic({0x101351640?, 0x101b7d2a0?})
/usr/local/go/src/runtime/panic.go:920 +0x26c
github.com/nats-io/nats-server/v2/server.(*Account).EnableJetStream(0xc00020fb80, 0xc000220240)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:1045 +0xa4
github.com/nats-io/nats-server/v2/server.(*Server).configJetStream(0xc000226d80, 0xc00020fb80)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:707 +0xdc
github.com/nats-io/nats-server/v2/server.(*Server).configAllJetStreamAccounts(0xc000226d80)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:768 +0x2b0
github.com/nats-io/nats-server/v2/server.(*Server).enableJetStreamAccounts(0xc000226d80?)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/jetstream.go:637 +0x128
github.com/nats-io/nats-server/v2/server.(*Server).reloadAuthorization(0xc000226d80)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:2039 +0x93c
github.com/nats-io/nats-server/v2/server.(*Server).applyOptions(0xc000226d80, 0xc000171c50, {0xc000074600, 0x2, 0xc000171c78?})
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1746 +0x2b8
github.com/nats-io/nats-server/v2/server.(*Server).reloadOptions(0xc000276000?, 0xc0000a6000, 0xc000276000)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1121 +0x178
github.com/nats-io/nats-server/v2/server.(*Server).ReloadOptions(0xc000226d80, 0xc000276000)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:1060 +0x368
github.com/nats-io/nats-server/v2/server.(*Server).Reload(0xc000226d80)
/Users/ivan/dev/go/src/github.com/nats-io/nats-server/server/reload.go:995 +0x104
```
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
A new option instructs the server to perform the TLS handshake first,
that is prior to sending the INFO protocol to the client.
Only clients that implement equivalent option would be able to connect
if the server runs with this option enabled.
The configuration would look something like this:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: true
}
```
The same option can be set to "auto" or a Go time duration to fallback
to the old behavior. This is intended for deployments where it is known
that not all clients have been upgraded to a client library providing
the TLS handshake first option.
After the delay has elapsed without receiving the TLS handshake from the
client, the server reverts to sending the INFO protocol so that older
clients can connect. Clients that do connect with the "TLS first" option
will be marked as such in the monitoring's Connz page/result. It will
allow the administrator to keep track of applications still needing to
upgrade.
The configuration would be similar to:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: auto
}
```
With the above value, the fallback delay used by the server is 50ms.
The duration can be explcitly set, say 300 milliseconds:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: "300ms"
}
```
It is understood that any configuration other that "true" will result in
the server sending the INFO protocol after the elapsed amount of time
without the client initiating the TLS handshake. Therefore, for
administrators that do not want any data transmitted in plain text, the
value must be set to "true" only. It will require applications to be
updated to a library that provides the option, which may or may not be
readily available.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
This could result in lookups failing, e.g. after a change in max msgs per subject to a lower value.
Also fixed a bug that would not prperly update psim during compact when throwing away the whole block and a subject had more than one message.
Signed-off-by: Derek Collison <derek@nats.io>
This is a possible fix for #4653.
Changes made:
1. Added tests for creating and updating consumers on a work queue
stream with overlapping subjects.
2. Check for overlapping subjects before
[updating](a25af02c73/server/consumer.go (L770))
the consumer config.
3. Changed [`func (*stream).partitionUnique(partitions []string)
bool`](a25af02c73/server/stream.go (L5269))
to accept the consumer name being checked so we can skip it while
checking for overlapping subjects (Required for
[`FilterSubjects`](a25af02c73/server/consumer.go (L75))
updates), wasn't needed before because the checks were made on creation
only.
There's only 1 thing that I'm not sure about.
In the [current work queue stream conflict
checks](a25af02c73/server/consumer.go (L796)),
the consumer config `Direct` is being checked if `false`, should we also
make this check before the update?
Signed-off-by: Pierre Mdawar <pierre@mdawar.dev>
This really was a cut/paste/typo error.
The effect was that when there was a pending PUBREL in JetStream, we would sometimes attempt to deliver it immediately once the client connected, cpending was already initialized, but the pubrel map was not (yet).
A new option instructs the server to perform the TLS handshake first,
that is prior to sending the INFO protocol to the client.
Only clients that implement equivalent option would be able to
connect if the server runs with this option enabled.
The configuration would look something like this:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: true
}
```
The same option can be set to "auto" or a Go time duration to fallback
to the old behavior. This is intended for deployments where it is known
that not all clients have been upgraded to a client library providing
the TLS handshake first option.
After the delay has elapsed without receiving the TLS handshake from
the client, the server reverts to sending the INFO protocol so that
older clients can connect. Clients that do connect with the "TLS first"
option will be marked as such in the monitoring's Connz page/result.
It will allow the administrator to keep track of applications still
needing to upgrade.
The configuration would be similar to:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: auto
}
```
With the above value, the fallback delay used by the server is 50ms.
The duration can be explcitly set, say 300 milliseconds:
```
...
tls {
cert_file: ...
key_file: ...
handshake_first: "300ms"
}
```
It is understood that any configuration other that "true" will result
in the server sending the INFO protocol after the elapsed amount of
time without the client initiating the TLS handshake. Therefore, for
administrators that do not want any data transmitted in plain text,
the value must be set to "true" only. It will require applications
to be updated to a library that provides the option, which may or
may not be readily available.
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Under heavy load with max msgs per subject of 1 the dmap, when
considered empty and resetting the initial min, could cause lookup
misses that would lead to excess messages in a stream and longer restore
issues.
Signed-off-by: Derek Collison <derek@nats.io>
Several strategies are used which are listed below.
1. Checking a RaftNode to see if it is the leader now uses atomics.
2. Checking if we are the JetStream meta leader from the server now uses
an atomic.
3. Accessing the JetStream context no longer requires a server lock,
uses atomic.Pointer.
4. Filestore syncBlocks would hold msgBlock locks during sync, now does
not.
Signed-off-by: Derek Collison <derek@nats.io>
Several strategies which are listed below.
1. Checking a RaftNode to see if it is the leader now uses atomics.
2. Checking if we are the JetStream meta leader from the server now uses an atomic.
3. Accessing the JetStream context no longer requires a server lock, uses atomic.Pointer.
4. Filestore syncBlocks would hold msgBlock locks during sync, now does not.
Signed-off-by: Derek Collison <derek@nats.io>
A test-only fix.
I can not reproduce the flapping behavior, but did see a race during
debugging suggesting that the CONNACK is delivered to the test before
`mqttProcessConnect` finishes and releases the record.
- [X] Tests added
- [X] Branch rebased on top of current main (`git pull --rebase origin
main`)
- [X] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
- [x] Build is green in Travis CI
- [X] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)
### Changes proposed in this pull request:
Fixes a race condition in some leader failover scenarios leading to
messages being potentially sourced more than once.
In some failure scenarios where the current leader of a stream sourcing
from other stream(s) gets shutdown while publications are happening on
the stream(s) being sourced leads to `setLeader(true)` being called on
the new leader for the sourcing stream before all the messages having
been sourced by the previous leader are completely processed such that
when the new leader does it's reverse scan from the last message in it's
view of the stream in order to know what sequence number to start the
consumer for the stream being sourced from, such that the last
message(s) sourced by the previous leader get sourced again, leading to
some messages being sourced more than once.
The existing `TestNoRaceJetStreamSuperClusterSources` test would
sidestep the issue by relying on the deduplication window in the
sourcing stream. Without deduplication the test is a flapper.
This avoid the race condition by adding a small delay before scanning
for the last message(s) having been sourced and starting the sources'
consumer(s). Now the test (without using the deduplication window) never
fails because more messages than expected have been received in the
sourcing stream.
(Also adds a guard to give up if `setupSourceConsumers()` is called and
we are no longer the leader for the stream (that check was already
present in `setupMirrorConsumer()` so assuming it was forgotten for
`setupSourceConsumers()`)