Major rewrite of NATS Streaming Server concepts section

and updates to the developing section. Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
2025-01-18 04:03:23 -08:00 · 2019-05-22 18:11:42 -06:00
parent 3f95943077
commit 5a458036a2
49 changed files with 1521 additions and 403 deletions
--- a/nats_streaming/fault-tolerance/active-server.md
+++ b/nats_streaming/fault-tolerance/active-server.md
@@ -0,0 +1,7 @@
+# Active Server
+
+There is a single Active server in the group. This server was the first to obtain the exclusive lock for storage. For the `FileStore` implementation, it means trying to get an advisory lock for a file located in the shared datastore. For the `SQLStore` implementation, a special table is used in which the owner of the lock updates a column. Other instances will steal the lock if the column is not updated for a certain amount of time.
+
+If the elected server fails to grab this lock because it is already locked, it will go back to standby.
+
+***Only the active server accesses the store and service all clients.***
--- a/nats_streaming/fault-tolerance/failover.md
+++ b/nats_streaming/fault-tolerance/failover.md
@@ -0,0 +1,9 @@
+# Failover
+
+When the active server fails, all standby servers will try to activate. The process consists of trying to get an exclusive lock on the storage.
+
+The first server that succeeds will become active and go through the process of recovering the store and service clients. It is as if a server in standalone mode was automatically restarted.
+
+All other servers that failed to get the store lock will go back to standby mode and stay in this mode until they stop receiving heartbeats from the current active server.
+
+It is possible that a standby trying to activate is not able to immediately acquire the store lock. When that happens, it goes back into standby mode, but if it fails to receive heartbeats from an active server, it will try again to acquire the store lock. The interval is random but as of now set to a bit more than a second.
--- a/nats_streaming/fault-tolerance/ft.md
+++ b/nats_streaming/fault-tolerance/ft.md
@@ -0,0 +1,15 @@
+# Fault Tolerance
+
+To minimize the single point of failure, NATS Streaming server can be run in Fault Tolerance mode. It works by having a group of servers with one acting as the active server (accessing the store) and handling all communication with clients, and all others acting as standby servers. 
+
+It is important to note that is not possible to run Nats Stream as Fault Tolerance mode and Clustering mode at the same time.
+
+To start a server in Fault Tolerance (FT) mode, you specify an FT group name.
+
+Here is an example on how starting 2 servers in FT mode running on the same host and embedding the NATS servers:
+
+```
+nats-streaming-server -store file -dir datastore -ft_group "ft" -cluster nats://localhost:6222 -routes nats://localhost:6223 -p 4222
+
+nats-streaming-server -store file -dir datastore -ft_group "ft" -cluster nats://localhost:6223 -routes nats://localhost:6222 -p 4223
+```
--- a/nats_streaming/fault-tolerance/shared-state.md
+++ b/nats_streaming/fault-tolerance/shared-state.md
@@ -0,0 +1,3 @@
+# Shared State
+
+Actual file replication to multiple disks is not handled by the Streaming server. This - if required - needs to be handled by the user. For the FileStore implementation that we currently provide, the data store needs to be mounted by all servers in the FT group (e.g. an NFS Mount (Gluster in Google Cloud or EFS in Amazon).
--- a/nats_streaming/fault-tolerance/standby-server.md
+++ b/nats_streaming/fault-tolerance/standby-server.md
@@ -0,0 +1,3 @@
+# Standby servers
+
+There can be as many as you want standby servers on the same group. These servers do not access the store and do not receive any data from the streaming clients. They are just running waiting for the detection of the active server failure.