Cirrus Link frequently gets questions around Sparkplug and quality of service. Users want to understand how Sparkplug leverages MQTT, what QoS Sparkplug uses and why, and if there is the potential for data loss when using MQTT Sparkplug. Below we discuss Sparkplug's use of MQTT QoS 0, why this QoS was chosen and how we minimize potential data loss when using QoS 0.
Legend
edge client = publishing client
host client = consuming client
Why do we use MQTT QoS 0 for Sparkplug Birth and Data messages?
Why not QoS 1 or 2?
QoS 1 and 2 messages can be persisted on the MQTT server
QoS 1 and 2 messages can be persisted on the server side when a host (consuming) client (e.g., MQTT Engine) goes offline. These messages will be stored until the host client comes back online. In most OT control systems, you do not want old messages to be treated as “live” and take actions like opening/closing of valves far later than when the message was originally published.
Not only do we have to worry about old messages arriving late to host clients, but we also have to worry about the impact on MQTT server resources (RAM and/or disk depending on implementation) when messages are persisted server-side. At large scale, servers will run out of resources quickly.
QoS 1 and 2 messages have higher publish-to-receive times due to the additional client-server handshaking.
How to ensure data is not lost when using QoS 0?
Sparkplug Sessions and Primary Host ID
Sparkplug sessions and use of the Primary Host Id ensure the edge client (e.g., MQTT Transmission) is connected to the MQTT server at the same time as the host client (e.g., MQTT Engine), so we know we are good to publish data messages when in an established session.
MQTT Store and Forward at the edge
Sparkplug’s use of MQTT Store and Forward at the edge allows the edge client to buffer data locally if the host is offline or the edge connection to the MQTT server is down.
MQTT Keep Alive
MQTT client-server connections will fail ungracefully. When this happens the client still has an open socket on its side and is publishing data that will never make it to the MQTT server. MQTT messages will be lost if published after a connection fails ungracefully, but before the client realizes the connection has failed. MQTT provides a mechanism to identify failed connections like this - the MQTT Keep Alive. The MQTT Keep Alive is an interval of time (measured in seconds) that the client must not allow to elapse without transmitting a packet to the MQTT Server. In the absence transmitted data during this interval the client will transmit a Ping Request control packet and wait for the server to reply with a Ping Response. When the server fails to receive any transmission from the client within 1.5x the Keep Alive value, it will disconnect the network connection with the client. Similarly if the client does not receive the Ping Response within a reasonable amount of time after sending a Ping Request is should close the network connection with the server.. In the case of the Sparkplug edge client, it will typically be configured to store all tag changes locally in the MQTT Store and Forward History Store until the connection to the server is reestablished. To mitigate the possibility of data loss in the MQTT Keep Alive window, Cirrus Link has implemented a rolling buffer covering 2x the client’s Keep Alive and it containing all tag changes that happened in that window. If the edge client loses connection to the MQTT server, it will flush the rolling buffer once the connection is restored to replay any potentially lost MQTT messages.
Sparkplug Message Sequence Numbers
If any message is lost due to some random network issue, but the connection never fails the Keep Alive (unlikely), a message would actually be lost. This lost message would be identified by the host client because a message sequence number would be missing.
To summarize, data loss with QoS 0 messages can occur in the MQTT Keep Alive window when a connection fails ungracefully. To mitigate this risk, Sparkplug uses Primary Host Id, and MQTT Store and Forward at the edge with a rolling buffer*. This will not guarantee zero data loss, but it will provide the highest possibility of delivering high speed OT data at scale with a minimal possibility of data loss.
*Rolling buffer is a Cirrus Link solution implemented on top of Sparkplug and is not in the Sparkplug spec currently.