DiffusionTM communication is composed of discrete messages. Each client sends various types of message to its server, and separately receives messages from the server. There are messages that carry new values for subscribed topics; application messages sent using the messaging API; and service messages that implement API features. All Diffusion’s protocols assure message delivery while a client remains connected to a server.
Diffusion achieves exceptional data rates by keeping messages in memory and not persisting them to disk. The trade-off for high performance is that messages can be lost if a client or server process crashes, or there is a network failure between client and server.
This article covers how Diffusion protects against message loss in the event of network failure, including the Diffusion 5.8 feature – Reliable Reconnection. Reliable Reconnection ensures that if a network connection is lost, a client session will either establish a new connection to the server with no message loss, or message loss will be detected and the session will be closed.
Connections and Reconnection
The server has a queue of messages to be delivered in order for each of its client sessions. Similarly, each client has a separate queue of messages to be delivered in order to the server. Depending on the protocol, each client maintains one or many long-lived TCP connections to a server. The WebSocket protocol uses a single bi-directional connection. HTTP long-polling protocols use a connection for messages from server to client, and a separate connection for messages from client to server.
Connections can fail and be re-established without loss of a session – a process known as reconnection. Reconnection is configured on the server, and further controlled by the reconnection strategy of the client’s session factory. The server configuration determines the reconnection timeout – a limit on how long the server will maintain a session for a disconnected client. If the client fails to reconnect before the reconnection timeout expires, the server will close the session. Reconnection can be disabled by setting the timeout to 0
.
The server configuration allows different reconnection settings for each connector. Here’s an example configuration from Connectors.xml
that enables reconnection for an Example connector with a reconnection timeout of 60 seconds.
<!-- Connectors.xml --> ... <connector name="Example"> ... <reconnect> <!-- The number of milliseconds the server will maintain a disconnected session. If the client fails to reconnect within this time, the server will close the session. --> <keep-alive>60000</keep-alive> </reconnect> </connector> ...
Even though a session is disconnected, both server and client will continue to queue messages for delivery. For example, the server will queue new update messages for topics to which the session is subscribed. Each message queue has a maximum size. If a new message causes the queue to overflow the configured maximum number of messages, the session will be closed.
The maximum queue size for server queues is configured in Server.xml
.
<!-- Server.xml --> <server> ... <client-queues> <default-queue-definition>default</default-queue-definition> <queue-definition name="default"> <!-- The maximum queue size --> <max-depth>1000</max-depth> </queue-definition> ...
It’s likely a larger backlog will accumulate in the message queue while a session is disconnected than while it is connected. So queues can extend to accommodate more messages for disconnected sessions, a higher maximum queue size can be configured in the reconnection settings.
<!-- Connectors.xml --> ... <connector name="Example"> ... <reconnect> <keep-alive>60000</keep-alive> <!-- Maximum queue size that applies when disconnected. Ignored if less than the queue-definition setting. --> <max-depth>5000</max-depth> </reconnect> </connector> ...
For some applications, the message backlog can be effectively managed using conflation. Conflation reduces the likelihood of a message queue overflowing, discards stale data, and shrinks the number of messages to be delivered when the client reconnects.
For reconnection to work, the client needs to detect failure in a timely fashion. Intermediate network devices such as routers, load balancers, and firewalls, add additional connection hops between the client and the server which can delay failure detection. To address this problem, Diffusion has several features to monitor the end-to-end health of connections, including automatic server pings (a “heartbeat” mechanism where the server regularly sends a request message to the client and expects it to respond with a separate message), and client-side connection activity monitoring.