Coral Blocks » CoralSequencer

State-of-the-Art Distributed Systems with CoralSequencer

cb — Sat, 09 Apr 2016 02:47:27 +0000

In this article we introduce the big picture of CoralSequencer, a full-fledged, ultra-low-latency, high-reliability, software-based middleware for the development of distributed systems based on asynchronous messages. We discuss CoralSequencer’s main parts and how it uses a sophisticated and low-latency protocol to distribute messages across nodes through reliable UDP multicast.

You should also check the YouTube video below where we present the main characteristics of the sequencer architecture together with some advanced features of CoralSequencer. Note: Contrary to the YouTube video, the video below has no ads.

/wp-content/uploads/videos/CoralSequencer.mp4

Quick Facts

All nodes read all messages in the exact same order, dropping messages they are not interested in
All messages are persisted so late-joining nodes can rewind and catch up to build the exact same state as other nodes
Message broadcasting is done through a reliable multicast UDP protocol: no message is ever lost
Supports all cloud environments through TCP without using multicast.
Change the transport protocol from UDP to TCP by simply flipping a configuration flag with no code changes.
Hybrid approach (UDP + TCP) to support extensions of the distributed system in the cloud.
No single point of failure, from the software down to the hardware infrastructure
Each session is automatically archived with all its messages so it can be replayed later for testing, simulation, analysis and auditing
High-level, straightforward API makes it easy to write nodes that publish/consume messages
As low-latency as it can be through UDP multicast
Zero garbage created per message – no gc overhead

Quick Features

Message agnostic – send and receive anything you want
Automatic replayer discovery through multicast, making it easy to move your replayers across machines
Message fragmentation at the protocol level, so you can transparently send messages of any size
Comprehensive test framework for deterministic single-threaded memory-transport automated tests
Choose your transport protocol through configuration without changing a single line of your application code: TCP (for the cloud), UDP (multicast), Shared-Memory (same machine) and Memory (for tests)
TCP Rewind
Non-rewinding nodes
Transparent batching and in-flight messages
Shared-memory Dispatcher node to avoid multicast fan-out in machines running several nodes pinned to the same CPU core
Full CLOUD support through TCP transport. Later to switch from TCP to multicast UDP you can simply flip a configuration flag
Easy replay a full past session archive file through an offline node for testing, validation, auditing, reports, etc. (from a local file or from a centralized remote server)
Tiered replayer architecture for scalability (optional)
Comprehensive, zero-garbage, binary and high-performance native serialization protocol with repeating groups, optional fields, IDL, etc. (optional)
Full duplex bridges (UDP and TCP)
Long distance bridges with TCP and UDP redundant channels for performance
A variety of internal messages providing features like node active/passive, node heartbeats, force passive, etc
Fully deterministic sequencer clock for the centralized distributed system time
Local and centralized timers with nanosecond precision
Sequencer-generated messages
Hot-Hot nodes in a perfect cluster, using the same node account with different instance IDs
Multiple sequencers in parallel with cross-connect nodes
⁠⁠C++ Node support (write a node in C++ that receives and sends messages through JNI)
⁠Nodes can choose from which sequence number to rewind from (allowing for managing state in customer snapshot servers)
⁠Nodes can commit a sequence number so that they don’t need to reprocess the whole event-stream in case of rewinding
Remote administration (telnet, rest and http)
Logger node
Archiver node
Admin Node
And many others

Node Example

package com.coralblocks.coralsequencer.node;

import java.nio.ByteBuffer;

import com.coralblocks.coralbits.util.ByteBufferUtils;
import com.coralblocks.coralbits.util.DateTimeUtils;
import com.coralblocks.coralreactor.admin.AdminAction;
import com.coralblocks.coralreactor.nio.NioReactor;
import com.coralblocks.coralreactor.util.Configuration;
import com.coralblocks.coralsequencer.message.Message;
import com.coralblocks.coralsequencer.mq.Node;

public class SampleNode extends Node {
	
	public SampleNode(NioReactor nio, String name, Configuration config) {
		
	    super(nio, name, config);
	    
	    addAdminAction(new AdminAction("sendTime") {
			@Override
			public boolean execute(CharSequence args, StringBuilder results) {
				sendTime();
				results.append("Time successfully sent!");
				return true;
			}
	    });
    }
	
	private void sendTime() {
		sendCommand("TIME-" + System.currentTimeMillis());
	}
	
	@Override
    protected void handleMessage(boolean isMine, Message msg) {
		
		if (!isMine) return; // not interested, quickly drop it...
		
		ByteBuffer data = msg.getData(); // the raw bytes of the message...
		
		long epochInNanos = eventStreamEpoch(); // deterministic centralized sequencer clock...
		
		CharSequence now = DateTimeUtils.formatDateTimeInNanos(epochInNanos);
		
		System.out.println("Saw my message at " + now + ": " + ByteBufferUtils.parseString(data));
    }
}

On-Premises and Cloud Infrastructure with CoralSequencer

cb — Fri, 03 Apr 2026 14:47:09 +0000

CoralSequencer supports multiple transport protocols, offering flexibility when building on-premises and cloud infrastructures. You can use UDP, multicast or unicast, TCP, or a combination of both to design your infrastructure in the way that best fits your needs, whether on-premises, in the cloud, or across both. In this article we’ll see some examples through diagrams.

On-Premises with Multicast

The primary and recommended transport for CoralSequencer, using an industry established reliable multicast UDP protocol. It is not only more efficient for distributing messages across nodes, but it also offers useful features such as multicast discovery.

The green arrows represent multicast UDP connections, while the blue arrows represent TCP connections. The Bridge serves as a TCP entry point into the distributed system.

IMPORTANT: Later below, we’ll see how to run the entire sequencer in the cloud using only TCP, with no multicast at all.

To make this diagram smaller, let’s abbreviate some components: Replayer = R, Logger = L, Bridge = B, Archiver = A and Sequencer = SEQ.

Extending the Distributed System to the Cloud

When multicast UDP is not available, we can use an industry established sequenced TCP protocol. Your choice for the CoralSequencer transport protocol does not affect your application code and logic in any way.

Note that we are extending our distributed system to the cloud through a single bridge-to-bridge TCP connection. The bridge on the cloud side provides connectivity to all cloud instances. It can have a backup bridge ready to take over in case it fails, so it does not become a single point of failure. You can also deploy more than one bridge on the cloud side to better distribute the load. It is important to understand that bridges can be chained together to build any network graph, but simplicity is often the best approach.

Also note that there is no multicast UDP connectivity in the cloud, only TCP and shared memory. The Dispatcher provides shared memory connectivity to nodes on the same machine, in this case within the same cloud instance. It connects to the bridge over TCP and distributes all messages to all nodes through shared memory, using a single memory mapped file. As a result, instead of maintaining a TCP socket per node, only a single TCP connection is required for the dispatcher.

IMPORTANT: Both the bridge and the dispatcher operate in full-duplex, handling both downstream messages and upstream commands.

Extending to Data Centers and External Clients

Pure TCP Sequencer Infrastructure

Note that there are no multicast UDP connections anywhere, only TCP connections.

Sequencer Deployed on the Cloud

There are no multicast UDP connections anywhere, only TCP. Different cloud regions are connected through bridge-to-bridge connections. Dispatchers reduce the number of TCP connections by using shared memory when running on the same cloud instance.

CoralSequencer Transport Protocols

CoralSequencer offers a variety of transport protocols that can be used without requiring any code changes. Simply change a config from UDP to TCP and your application is ready to be deployed with a totally different transport protocol. Below we list the available CoralSequencer transport protocols:

Multicast UDP: The primary and recommended transport for CoralSequencer, using an industry established reliable multicast UDP protocol.
TCP: The transport used by CoralSequencer when multicast UDP is not available, such as in the cloud, or not desirable, such as for external clients, using an industry established sequenced TCP protocol.
Shared Memory: The transport used within the same machine or cloud instance to minimize the number of network connections. The Dispatcher is the component that provides connectivity through shared memory to nodes on the same machine or cloud instance.
Dual (UDP + TCP): Two redundant connections, one TCP and one unicast UDP, streaming identical messages. The receiving side processes whichever arrives first and discards the duplicate.
Fuse (reliable UDP): One unicast UDP connection streaming messages, along with an idle TCP connection used for retransmission of lost messages. The receiving side requests retransmission of any gaps through the TCP connection.

Conclusion

CoralSequencer is designed to adapt to the infrastructure you have, rather than forcing you into a single network model. Whether your deployment is fully on premises, fully in the cloud, or split across both, you can combine multicast and unicast UDP, TCP, shared memory, bridges, and dispatchers to build a topology that matches your performance, reliability, high availability, and operational requirements. Most importantly, these transport choices do not require application code changes, enabling incremental evolution of your infrastructure while preserving deterministic behavior and message ordering across the system.

Building a first-class exchange architecture with CoralSequencer

cb — Wed, 30 Jun 2021 22:17:24 +0000

In this article we will explore an architecture employed by some of the most sophisticated electronic exchanges, the ones that need to handle millions of orders per day with ultra-low-latency and high-availability. Most of the exchange main architecture components will be presented as CoralSequencer nodes and discussed through diagrams. The goal of this article is to demonstrate how a total-ordered messaging middleware such as CoralSequencer naturally enables the implementation and control of complex distributed system through a tight integration of all its moving parts. This article doesn’t mention specifics about any exchange’s internal systems and instead talks about the big picture and the general concepts, which will vary from exchange to exchange.

You should also check our Sequencer Architecture YouTube video where we present the main characteristics of the sequencer architecture
together with some advanced features of CoralSequencer. The video link is https://www.youtube.com/watch?v=DyktSiBTCdk.

You should also check our open-source matching engine => CoralME (https://www.github.com/coralblocks/CoralMe).

TCP Order Port

Although it is technically possible to run more than one Order Port server inside the same node, each node will usually run a single Order Port server (FIX, BINARY or XXXX) capable of accepting several gateway connections from customers. The same JVM can be configured to run several nodes, each node running an Order Port server, each server listening to incoming gateway connections on its own network port. And all that is run by a single critical high-performance thread, pinned to an isolated CPU core. When we say single-thread, we actually mean it. The handling of the messaging middleware (sending and receiving messages to and from CoralSequencer) is also done by this same critical thread, through non-blocking network I/O operations.

Order Port High-Availability and Failover

With hot-hot failover, the customer can literally pull the plug from one of his clustered gateway machines, for a zero-downtime failover. Note that the same is true on the exchange side: the exchange can pull the plug from any Order Port instance for a zero-downtime failover.

With hot-warm failover, there is a small down-time as the backup Order Port node needs to be activated (manually or automatically) and the customer gateway needs to connect to the new machine, either though a virtual IP change on the exchange side or through a destination IP change on the customer side.

Matching Engine

The matching engine is the brain of the exchange, where buy orders are matching sell orders (and vice-versa). It builds and maintains a double-sided (bids and asks) liquidity order book for each symbol, and matches incoming orders accordingly.

The matching engine is critical (i.e. can never stop) and needs to be as fast as possible (i.e. can never block) as all other exchange components depend on it.

Instead of using one matching engine for all symbols, it makes a lot of sense to use a sharding strategy to spread the load across several matching engines. For example, an exchange may choose to have one matching engine for symbol names beginning with A-G, another one for H-N and another one for O-Z. Of course a good sharding strategy will depend on how active each symbol is so that to spread the load as evenly as possible.

TCP Market Data

TCP market data is slower but has some advantages over Multicast UDP market data. First the customer market data feed can subscribe to receive only the symbols it is interested in. With Multicast UDP market data, all symbols are pushed to the customer. Not the entire number of symbols from the exchange, but all symbols present in that Multicast UDP channel as defined by the sharding strategy. Second, upon subscription, the customer usually receives a snapshot of the entire market data book. With Multicast UDP only incremental updates are sent, never snapshots. And third, because the customer can identify himself over the TCP connection, it is usually easy for the exchange to flag the market data orders belonging to that customer. That’s useful information that the exchange customer can use to prevent trading to himself, which a compliant exchange might choose to disallow through order rejects.

Multicast UDP Market Data

An exchange can (and will) have thousands of customers interested in connectivity for market data. Maintaining a TCP market data channel for each one of them can quickly become impractical. The solution is to push out market data through multicast UDP to whoever wants to listen. That’s usually done by private network lines and collocation, never though the Internet due to the unreliability of UDP and because multicast is not supported over the public Internet.

It is important to notice that a multicast udp market data node will only broadcast incremental market data updates, never book snapshots. Therefore, some special logic needs to be implemented on the customer side to: first listen to the multicast UDP market data channel; buffer the initial incremental update messages; make a request for the initial market data book (i.e. snapshot) from a TCP market data channel; close the TCP connection; apply the buffered messages; then start applying the live incremental updates from the UDP channel.

Because of the unreliability nature of UDP, the exchange will usually multicast the market data over two redundant channels. The customer market data feed will listen to both at the same time and consume the packet that arrives first, discarding the slower duplicate. That minimizes the chances of gaps in the line (i.e. lost packets), as losing the same packet on both lines becomes less likely but still possible. Some exchanges will provide retransmission servers, specially for those customers that want/need to have all incremental updates for historical reference. If a retransmission server is not provided and/or the customer does not want to go through the trouble of implementing the retransmission request logic, the customer market data feed can just start from scratch by requesting a new book snapshot after a packet loss.

Last but not least, the customer can use the order id present in the market data incremental updates and check against its open orders to find out whether that order is his or not. As already said, that’s useful information that the exchange customer can use to prevent illegal trades-to-self.

TCP Drop Copy ( position, position, position )

For an exchange customer to trust solely his gateways to be trading and maintaining a certain position can be very dangerous. It is always possible for a gateway bug to slip into production and create large losses without him even knowing about it. Therefore it is important to always match the position the gateway is reporting with the position that the drop copy is reporting. A mismatch there is a serious indication that something is very wrong and needs to be addressed by the exchange customer immediately.

And Much More

Because all CoralSequencer nodes see the exact same set of messages, in the exact same order, always, the exchange can keep evolving by adding more nodes to perform any sort of work/action/logic/control without disrupting any other part of the distributed system.

To summarize, CoralSequencer provides the following features which are critical for a first-class electronic exchange: parallelism (nodes can truly run in parallel); tight integration (all nodes see the same messages in the same order); decoupling (nodes can evolve independently); high-availability/failover (when a node fails, another one can be running and building state to take over immediately); scalability/load balancing (just add more nodes); elasticity (nodes can lag during activity peaks without affecting the system as a whole); and resiliency (nodes can fail / stop working without taking the whole system down).

Shared Memory Transport x Multicast Transport

cb — Thu, 14 Oct 2021 12:43:35 +0000

A CPU core is a scarce resource in high demand from all the different process running on a computer machine. When you are choosing how to allocate your nodes to available CPU cores, CoralSequencer gives you the option to run several nodes inside the same NioReactor thread, which in turn will be pinned to an isolated CPU core. As the number of nodes inside the same CPU core grows, fan-out might become an issue affecting latency, as now the thread has to cycle through all nodes as it reads multicast messages from the event-stream. In this article we explain (with diagrams) how you can deal with that scenario by choosing CoralSequencer’s shared-memory transport instead of multicast at the node machine level.

Multicast Fan-Out

So let’s say you have 20 nodes running on the same thread. After Node 1 reads its multicast message, it will have to wait for all other 19 nodes to read the multicast message before it can proceed to read the next message. The network card receiving the multicast message has to make 20 copies of the message, placing each copy in the individual underlying native receive socket buffer of each node. The diagram below illustrates this scenario:

So that begs the question: How will an individual node latency be affected as we keep adding more nodes to its thread? Below the results of the benchmark tests:

1 Node: (baseline or zero fan-out)

20:00:56.170513-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 4.503 micros | Min Time: 3.918 micros | Max Time: 704.095 micros | 75% = [avg: 4.272 micros, max: 4.69 micros] | 90% = [avg: 4.35 micros, max: 4.799 micros] | 99% = [avg: 4.41 micros, max: 5.783 micros] | 99.9% = [avg: 4.425 micros, max: 7.037 micros] | 99.99% = [avg: 4.468 micros, max: 264.298 micros] | 99.999% = [avg: 4.497 micros, max: 436.127 micros]

2 Nodes:

20:22:48.562317-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 4.556 micros | Min Time: 3.983 micros | Max Time: 650.027 micros | 75% = [avg: 4.227 micros, max: 4.696 micros] | 90% = [avg: 4.38 micros, max: 5.213 micros] | 99% = [avg: 4.462 micros, max: 5.706 micros] | 99.9% = [avg: 4.478 micros, max: 6.871 micros] | 99.99% = [avg: 4.519 micros, max: 274.402 micros] | 99.999% = [avg: 4.55 micros, max: 490.207 micros]

3 Nodes:

20:21:33.572903-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 5.614 micros | Min Time: 3.943 micros | Max Time: 691.881 micros | 75% = [avg: 5.224 micros, max: 5.954 micros] | 90% = [avg: 5.355 micros, max: 6.098 micros] | 99% = [avg: 5.503 micros, max: 7.64 micros] | 99.9% = [avg: 5.527 micros, max: 9.422 micros] | 99.99% = [avg: 5.582 micros, max: 253.727 micros] | 99.999% = [avg: 5.608 micros, max: 398.772 micros]

4 Nodes:

20:20:08.231927-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 6.728 micros | Min Time: 4.555 micros | Max Time: 464.322 micros | 75% = [avg: 6.562 micros, max: 6.681 micros] | 90% = [avg: 6.587 micros, max: 6.76 micros] | 99% = [avg: 6.619 micros, max: 7.348 micros] | 99.9% = [avg: 6.631 micros, max: 9.444 micros] | 99.99% = [avg: 6.702 micros, max: 208.45 micros] | 99.999% = [avg: 6.724 micros, max: 331.632 micros]

5 Nodes:

20:04:31.776798-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 7.982 micros | Min Time: 5.433 micros | Max Time: 364.291 micros | 75% = [avg: 7.667 micros, max: 8.355 micros] | 90% = [avg: 7.795 micros, max: 8.509 micros] | 99% = [avg: 7.872 micros, max: 9.093 micros] | 99.9% = [avg: 7.886 micros, max: 11.372 micros] | 99.99% = [avg: 7.959 micros, max: 190.152 micros] | 99.999% = [avg: 7.978 micros, max: 278.21 micros]

10 Nodes:

20:10:03.527401-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 13.555 micros | Min Time: 7.098 micros | Max Time: 266.888 micros | 75% = [avg: 12.773 micros, max: 15.157 micros] | 90% = [avg: 13.201 micros, max: 15.634 micros] | 99% = [avg: 13.436 micros, max: 16.061 micros] | 99.9% = [avg: 13.475 micros, max: 58.272 micros] | 99.99% = [avg: 13.539 micros, max: 132.831 micros] | 99.999% = [avg: 13.552 micros, max: 189.882 micros]

15 Nodes:

20:14:16.156609-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 16.647 micros | Min Time: 11.668 micros | Max Time: 189.303 micros | 75% = [avg: 15.769 micros, max: 17.76 micros] | 90% = [avg: 16.205 micros, max: 19.094 micros] | 99% = [avg: 16.522 micros, max: 20.563 micros] | 99.9% = [avg: 16.58 micros, max: 55.764 micros] | 99.99% = [avg: 16.636 micros, max: 107.489 micros] | 99.999% = [avg: 16.645 micros, max: 143.581 micros]

20 Nodes:

20:18:00.346389-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 21.842 micros | Min Time: 13.524 micros | Max Time: 170.239 micros | 75% = [avg: 20.484 micros, max: 24.261 micros] | 90% = [avg: 21.237 micros, max: 25.691 micros] | 99% = [avg: 21.711 micros, max: 27.423 micros] | 99.9% = [avg: 21.788 micros, max: 59.232 micros] | 99.99% = [avg: 21.832 micros, max: 92.211 micros] | 99.999% = [avg: 21.84 micros, max: 132.365 micros]

Shared Memory Fan-Out through the Dispatcher node

When you use CoralSequencer’s Dispatcher node, only the dispatcher node will read the multicast messages from the network card. It will proceed to write the messages to shared memory. Then any other node running on that machine can choose to read the messages from the same shared memory (from the dispatcher) instead of multicast (from the network card). The diagram below illustrates this scenario:

Is reading from shared memory any faster than reading from the network, even when we are using non-blocking networks reads? Let’s repeat the same benchmark tests to find out:

1 Node: (baseline or zero fan-out)

06:28:56.524909-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 5.136 micros | Min Time: 4.173 micros | Max Time: 594.837 micros | 75% = [avg: 4.903 micros, max: 5.171 micros] | 90% = [avg: 4.954 micros, max: 5.46 micros] | 99% = [avg: 5.021 micros, max: 6.042 micros] | 99.9% = [avg: 5.034 micros, max: 7.875 micros] | 99.99% = [avg: 5.102 micros, max: 268.368 micros] | 99.999% = [avg: 5.131 micros, max: 419.933 micros]

2 Nodes:

06:30:40.480024-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 5.42 micros | Min Time: 4.246 micros | Max Time: 581.791 micros | 75% = [avg: 5.156 micros, max: 5.591 micros] | 90% = [avg: 5.239 micros, max: 5.754 micros] | 99% = [avg: 5.303 micros, max: 6.321 micros] | 99.9% = [avg: 5.316 micros, max: 7.977 micros] | 99.99% = [avg: 5.385 micros, max: 292.002 micros] | 99.999% = [avg: 5.415 micros, max: 459.128 micros]

3 Nodes:

06:32:19.770820-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 5.649 micros | Min Time: 4.174 micros | Max Time: 1.023 millis | 75% = [avg: 5.366 micros, max: 5.793 micros] | 90% = [avg: 5.457 micros, max: 6.059 micros] | 99% = [avg: 5.527 micros, max: 6.491 micros] | 99.9% = [avg: 5.54 micros, max: 8.244 micros] | 99.99% = [avg: 5.614 micros, max: 282.519 micros] | 99.999% = [avg: 5.643 micros, max: 453.64 micros]

4 Nodes:

06:35:39.689059-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 5.77 micros | Min Time: 4.26 micros | Max Time: 645.956 micros | 75% = [avg: 5.479 micros, max: 5.948 micros] | 90% = [avg: 5.577 micros, max: 6.195 micros] | 99% = [avg: 5.65 micros, max: 6.781 micros] | 99.9% = [avg: 5.664 micros, max: 8.362 micros] | 99.99% = [avg: 5.736 micros, max: 264.038 micros] | 99.999% = [avg: 5.764 micros, max: 437.225 micros]

5 Nodes:

06:40:42.531026-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 6.291 micros | Min Time: 4.361 micros | Max Time: 753.637 micros | 75% = [avg: 5.929 micros, max: 6.636 micros] | 90% = [avg: 6.068 micros, max: 6.912 micros] | 99% = [avg: 6.164 micros, max: 7.488 micros] | 99.9% = [avg: 6.18 micros, max: 9.14 micros] | 99.99% = [avg: 6.256 micros, max: 272.228 micros] | 99.999% = [avg: 6.285 micros, max: 451.461 micros]

10 Nodes:

06:44:12.332180-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 6.722 micros | Min Time: 4.478 micros | Max Time: 509.009 micros | 75% = [avg: 6.245 micros, max: 7.258 micros] | 90% = [avg: 6.455 micros, max: 7.787 micros] | 99% = [avg: 6.599 micros, max: 8.502 micros] | 99.9% = [avg: 6.618 micros, max: 10.021 micros] | 99.99% = [avg: 6.693 micros, max: 235.364 micros] | 99.999% = [avg: 6.718 micros, max: 370.217 micros]

15 Nodes:

06:48:31.743422-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 7.531 micros | Min Time: 4.759 micros | Max Time: 499.443 micros | 75% = [avg: 7.028 micros, max: 7.884 micros] | 90% = [avg: 7.202 micros, max: 8.31 micros] | 99% = [avg: 7.373 micros, max: 10.167 micros] | 99.9% = [avg: 7.402 micros, max: 12.745 micros] | 99.99% = [avg: 7.499 micros, max: 251.521 micros] | 99.999% = [avg: 7.526 micros, max: 403.692 micros]

20 Nodes:

06:53:27.062402-INFO NODE1 Finished latency test! msgSize=16 results=Iterations: 1,000,000 | Avg Time: 9.506 micros | Min Time: 4.866 micros | Max Time: 438.516 micros | 75% = [avg: 9.003 micros, max: 10.057 micros] | 90% = [avg: 9.213 micros, max: 10.482 micros] | 99% = [avg: 9.356 micros, max: 11.654 micros] | 99.9% = [avg: 9.381 micros, max: 17.448 micros] | 99.99% = [avg: 9.479 micros, max: 217.556 micros] | 99.999% = [avg: 9.502 micros, max: 348.865 micros]

Graph Comparison

FAQ:

Does the dispatcher use a circular buffer as shared memory?
Yes, and because it is shared-memory through a memory-mapped file, it can be made as large as your disk space can handle. Furthermore, the circular buffer provides safety guards so that a consumer is not reading a message that is currently being written by the producer.
Do the nodes read from the same shared-memory space?
Yes, the dispatcher writes once to the shared-memory space. All nodes read from this same shared-memory space.
Can nodes in different threads/JVMs be reading from the same dispatcher at the same time?
Yes, the dispatcher can handle multiple threads consuming the same messages concurrently. They do not need to be in the same thread as they are in this example. The reason they are placed in the same thread in this example is to illustrate and address the fan-out issue. If they are reading from multiple threads then the fan-out issue goes away. But you would still have gains by removing the load from the network card so that it does not have to copy the same message to multiple underlying socket buffers. With the dispatcher, the network card copies the messages only once to the shared-memory store and all the nodes read the messages from that same shared-memory store.

Conclusion

Adding a shared-memory dispatcher improves latency in fan-out scenarios, when for some reason you need/want to run several nodes inside the same NioReactor thread (i.e. inside the same CPU core). The dispatcher takes the load off the network card as now the network card has to deliver the message only once (to the dispatcher) instead of 20 times (to each of the 20 nodes).

Is CoralSequencer really deterministic? What about the clock?

cb — Wed, 23 Jun 2021 00:10:14 +0000

In this article we explore the deterministic nature of CoralSequencer and how its total ordered message stream is a natural enabler of high-availability clusters.

One of the most important features of CoralSequencer is the guarantee that all nodes will consume the exact same set of messages in the exact same order, always. What follows from this premise is that all nodes can become deterministic finite-state machines (FSM), where the same messages input will always transition the node into the exact same state. In other words, when a node starts from scratch (i.e. initial blank state) and consumes the same input messages, its final state will always be the same. That allows a backup node to late join a CoralSequencer session and operate as a mirror (i.e. exact copy) of the primary node, in what is called a high-availability cluster.

When we say that CoralSequencer is deterministic, what we are saying is that CoralSequencer supports nodes with deterministic state based on the event-stream messages. That feature allows for the creation of cluster of nodes that can be used for high-availability and failover. But what about the non-deterministic nature of clocks?

To explore this issue, let’s code a simple matching engine that will work as a FSM for high-availability and failover:

package com.coralblocks.coralsequencer.node;

import static com.coralblocks.corallog.Log.*;

import com.coralblocks.coralbits.util.DateTimeUtils;
import com.coralblocks.coralreactor.nio.NioReactor;
import com.coralblocks.coralreactor.util.Configuration;
import com.coralblocks.coralsequencer.message.Message;
import com.coralblocks.coralsequencer.mq.Node;

public class MatchingEngineNode extends Node {
	
	private long evenMsg;
	private long previousMatchTime;
	
	private final StringBuilder sb = new StringBuilder(64);

	public MatchingEngineNode(NioReactor nio, String name, Configuration config) {
		super(nio, name, config);
	}
	
	@Override
	protected void handleOpened() {
		// initial/blank state...
		evenMsg = -1;
		previousMatchTime = -1;
	}
	
	@Override
	protected void handleRewinded() { // caught up with live event-stream...
		
		sb.setLength(0);
		if (previousMatchTime != -1) {
			DateTimeUtils.formatDateTimeInMillis(previousMatchTime, sb);
		} else {
			sb.append("-1");
		}
		
		Sysout.log(name, 
				  "State after catching up with live event-stream:",
				  "evenMsg=", evenMsg, "previousMatchTime=", sb);
	}
	
	@Override
	protected void handleMessage(boolean isMine, Message msg) {
		
		if (isMine) return; // I'm not going to match my own messages...
		
		long seq = msg.getSequence();
		
		if (seq % 2 != 0) return; // I only match even sequence numbers...
		
		if (evenMsg > 0) {
			
			long nowInMillis = System.currentTimeMillis(); // non-deterministic clock
			
			sb.setLength(0);
			sb.append("MATCHED ");
			sb.append(evenMsg).append(" => ").append(seq);
			sb.append(" @ ");
			DateTimeUtils.formatDateTimeInMillis(nowInMillis, sb);
			sb.append(" previous=");
			if (previousMatchTime != -1) {
				DateTimeUtils.formatDateTimeInMillis(previousMatchTime, sb);
			} else {
				sb.append("-1");
			}
			
			Sysout.log(name, "isRewinding=", isRewinding(), sb);
			
			sendCommand(sb);
			
			previousMatchTime = nowInMillis;
			
			evenMsg = -1;
			
		} else {
			
			evenMsg = seq;
			
		}
	}
}

The logic above matches messages that have an even sequence number, skipping its own messages. Below the output of this MatchingEngineNode when it sees some live messages in the event-stream:

Now when we go ahead and start a second node instance to form a cluster, we notice a state inconsistency that breaks determinism. And the cluster!

The code is using System.currentTimeMillis() to compute the timestamp, which clearly returns a different value when it is called in the future by a node joining the cluster. The solution is to not use this non-deterministic clock and instead resort to CoralSequencer’s deterministic event-stream clock. Below the single-line change to the MatchingEngineNode code to fix everything:

// Intead of this:
long nowInMillis = System.currentTimeMillis(); // non-deterministic clock

// We should use this:
long nowInMillis = currentSequencerTime() / 1000000L; // deterministic clock

The currentSequencerTime() method returns the time as determined by the sequencer and placed in the event-stream, therefore it will always return the same value depending on the message the node is currently consuming from the event-stream. In other words, it will always return the same value for the same position in the event-stream, no matter when the node calls this method, now or two hours later. It returns the epoch time in nanoseconds, so we divide by 1,000,000 to get the epoch time in milliseconds, which is what we need.

With this change in the code, we now run the same experiment again, starting everything from scratch, with a new CoralSequencer session. The first node instance:

The second node instance:

So there you go! No matter when you start the node instance, its clock will always produce the same deterministic time from the event-stream, producing the exact same state for the node joining the cluster. State is always deterministic and consistent, and running a high-availability cluster with zero downtime failover becomes straightforward as you can see in the video below. CoralSequencer actually goes a step further and, in addition to the deterministic clock, gives you deterministic timers, but that’s a topic for another article.

(Maximize the video below for a better viewing experience)

/wp-content/uploads/videos/HotHot.mp4

Writing a C++ CoralSequencer Node

cb — Wed, 07 Feb 2024 11:43:06 +0000

Writing some C++ code that gets called by your Java code is trivial through shared libraries but a more interesting project is to do the inverse: to call Java code from a C++ system. In this article we write a C++ CoralSequencer node to perform a latency benchmark test.

Overview

To effectively call Java from C++, your C++ code must instantiate and start a Java Virtual Machine to execute the Java code. The same JVM will then call your C++ system back through a shared library. The goal is to allow any C++ subsystem to directly interact with the CoralSequencer distributed system.

The Code

Below a simple CNode CoralSequencer node that your C++ system can use to receive a callback with the CoralSequencer event-stream message.

public class CNode extends Node {
	
	static {
        	System.loadLibrary("CNode"); // the shared library used to send the callback to C++
	}

	public CNode(NioReactor nio, String name, Configuration config) {
		super(nio, name, config);
	}
	
	public native void handleMessageC(boolean isMine, Message msg);
	
	@Override
	protected void handleMessage(boolean isMine, Message msg) {
		handleMessageC(isMine, msg); // call C++
	}

}

To run this node we use the simple mq file below: (the file path is ./mqs/cnode.mq)

VM addAdmin telnet 57
VM newNode NODE7 com.coralblocks.coralsequencer.mq.CNode
NODE7 open
NODE7 activate

Now we write the C++ code to instantiate a JVM and run this mq file to start our CNode.

    #include 
    // Many other includes here (omitted for clarity)
    #include "com_coralblocks_coralsequencer_mq_CNode.h"
    using namespace std;

    static const int ITERATIONS = 2000000;
    static const int WARMUP = 1000000;
    static const int MSG_SIZE = 256;

    long get_nano_ts(timespec* ts) {
        clock_gettime(CLOCK_MONOTONIC, ts);
        return ts->tv_sec * 1000000000 + ts->tv_nsec;
    }

    struct mi {
        long value;
    };

    void add_perc(stringstream& ss, int size, double perc, map* map) {
    
        // omitted for clarity
    }

    char* createRandomCharArray(int size) {
        // omitted for clarity
    }

    int main(int argc, char **argv) {

        JavaVM *jvm;                       // Pointer to the JVM (Java Virtual Machine)
        JNIEnv *env;                       // Pointer to native interface
        JavaVMInitArgs vm_args;            // JVM initialization arguments
        JavaVMOption options[24];          // JVM options

        // add the 25 JVM options here (omitted for clarity)
        
        vm_args.version = JNI_VERSION_1_6;                      // Set the JNI version
        vm_args.nOptions = 24;                                  // Set the number of options
        vm_args.options = options;                              // Set the options to the JVM
        
        // Load and initialize the JVM
        JNI_CreateJavaVM(&jvm, (void**)&env, &vm_args);

        cout << "JVM created!!!" << endl;

        jvm->AttachCurrentThread((void**)&env, NULL);

        jclass startJavaClass = env->FindClass("com/coralblocks/coralsequencer/Start");
        jmethodID findAppMethod = env->GetStaticMethodID(startJavaClass, "findApplication", "(Ljava/lang/String;)Lcom/coralblocks/coralsequencer/app/Application;");
        jclass nodeClass = env->FindClass("com/coralblocks/coralsequencer/mq/CNode");
        jmethodID sendCommandMethod = env->GetMethodID(nodeClass, "sendCommand", "(Ljava/lang/CharSequence;)Z");

        jstring str1 = env->NewStringUTF("mqs/cnode.mq");
        jclass stringClass = env->FindClass("java/lang/String");
        jobject args = env->NewObjectArray(1, stringClass, str1);
        jmethodID mainMethod = env->GetStaticMethodID(startJavaClass, "main", "([Ljava/lang/String;)V");
        jmethodID isActiveMethod = env->GetMethodID(nodeClass, "isActive", "()Z");

        cout << "About to call Java main method..." << endl;

        env->CallStaticVoidMethod(startJavaClass, mainMethod, args);

        cout << "Returned from Java main method!" << endl;

        // Get the node (NODE7)
        jobject node = env->CallStaticObjectMethod(startJavaClass, findAppMethod, env->NewStringUTF("NODE7"));

        cout << "Waiting for node to become active..." << endl;

        // Sleep until node becomes active
        while(env->CallBooleanMethod(node, isActiveMethod) == JNI_FALSE) sleep(1);

        cout << "isActive() returned true!" << endl;

        jstring msgToSend = env->NewStringUTF(createRandomCharArray(MSG_SIZE));

        cout << "About to send first message!" << endl;

        env->CallObjectMethod(node, sendCommandMethod, msgToSend);

        cout << "First message sent!" << endl;

        jvm->DetachCurrentThread();

        // Release the JVM
        jvm->DestroyJavaVM(); // this will wait for Java threads to die...

        cout << "JVM Destroyed!!!" << endl;

        return 0;
    }

    struct timespec ts;
    long startTime = 0;
    long endTime = 0;
    map* results;
    int iterations = 0;

    jobject node;
    jmethodID sendCommandMethod;
    jmethodID isRewindingMethod;
    jstring msgToSend;

    JNIEXPORT void JNICALL Java_com_coralblocks_coralsequencer_mq_CNode_handleMessageC
    (JNIEnv *env, jobject obj, jboolean isMine, jobject msg) {

        endTime = get_nano_ts(&ts);

        if (node == NULL) {
            jclass startJavaClass = env->FindClass("com/coralblocks/coralsequencer/Start");
            jmethodID findAppMethod = env->GetStaticMethodID(startJavaClass, "findApplication", "(Ljava/lang/String;)Lcom/coralblocks/coralsequencer/app/Application;");
            jclass nodeClass = env->FindClass("com/coralblocks/coralsequencer/mq/CNode");
            sendCommandMethod = env->GetMethodID(nodeClass, "sendCommand", "(Ljava/lang/CharSequence;)Z");
            isRewindingMethod = env->GetMethodID(nodeClass, "isRewinding", "()Z");

            node = env->CallStaticObjectMethod(startJavaClass, findAppMethod, env->NewStringUTF("NODE7"));

            node = env->NewGlobalRef(node);

            msgToSend = env->NewStringUTF(createRandomCharArray(MSG_SIZE));

            results = new map();
        }

        if (env->CallBooleanMethod(node, isRewindingMethod) == JNI_TRUE) return;

        int res = startTime > 0 ? (endTime - startTime) : 1; // 1 only for first message/pass

        if (res <= 0) res = 1;

        if (iterations++ >= WARMUP) {
            
            // add the result (omitted for clarity)
        }

        if (iterations == ITERATIONS) {

            // print the results (omitted for clarity) 

        } else {

            startTime = get_nano_ts(&ts);

            env->CallObjectMethod(node, sendCommandMethod, msgToSend);

        }

    }

NOTE: The full source code can be seen here.

The trick is to compile this C++ code twice: as the main C++ program to be executed (the one that will start the JVM) and as the shared library that will be used by CNode.java (the one that will receive the callback):

# Using Java 21 and clang 14.0.6

# Compile the main C++ program to start the JVM
clang++ -I"$JAVA_HOME/include" -I"$JAVA_HOME/include/linux" -o bin/linux/Bench src/main/c/linux/Bench.cpp -L"$JAVA_HOME/lib/server" -ljvm -Wno-write-strings

# Generate the com_coralblocks_coralsequencer_mq_CNode.h header file
javac -h src/main/c/linux -d target/classes -sourcepath src/main/java -cp target/coralsequencer-all.jar src/main/java/com/coralblocks/coralsequencer/mq/CNode.java

# Compile the shared library to be used by CNode.java
clang++ -shared -fPIC -I"$JAVA_HOME/include" -I"$JAVA_HOME/include/linux/" src/main/c/linux/Bench.cpp -o lib/libCNode.so -L"$JAVA_HOME/lib/server" -ljvm -Wno-write-strings

Now when we execute our C++ application with the command-line below:

$ LD_LIBRARY_PATH=$JAVA_HOME/lib/server ./bin/linux/Bench

We get the following latency benchmark results:

Message Size: 256 bytes
Messages: 1,000,000
Avg Time: 4.808 micros
Min Time: 3.754 micros
Max Time: 717.729 micros
75% = [avg: 4.522 micros, max: 5.055 micros]
90% = [avg: 4.618 micros, max: 5.159 micros]
99% = [avg: 4.681 micros, max: 6.339 micros]
99.9% = [avg: 4.717 micros, max: 20.547 micros]
99.99% = [avg: 4.773 micros, max: 268.823 micros]
99.999% = [avg: 4.803 micros, max: 427.988 micros]

As expected this is very close to our official CoralSequencer latency numbers, as described here.

Conclusion

It is straightforward (but not trivial) to write C++ applications that use the CoralSequencer infra-structure to interact with your distributed system. The performance cost to cross the native to JVM border and back is very small as the benchmark results in this article demonstrate.

HotSpot, JIT, AOT and Warm-Up

cb — Mon, 04 Nov 2024 16:09:32 +0000

The HotSpot JVM takes some time to profile a running Java application for hot spots in the code and then optimizes by compiling (to assembly) and inlining (when possible) these hot spot methods. That’s great because the JIT (just-in-time) compiler can surgically and aggressively optimize the parts of your application that matter the most instead of taking the AOT (ahead-of-time) approach of compiling and trying to optimize the whole thing beforehand. For example, method inlining is an aggressive form of optimization that usually requires runtime profiling information since inlining everything is impractical/impossible.

In this article we explore the JVM option -Xcomp -XX:-TieredCompilation to compile every method right before its first invocation. The drawback is that without any profiling this optimization can be conservative. For example, even though some basic method inlining is still performed, a more aggressive inlining approach can not happen without runtime profiling. The advantage is that your application will be able to perform at a native/assembly level right when it starts (even if not with the most optimized code) without having to wait until the HotSpot JVM has gathered enough profiling to compile and optimize the hot methods.

We also explore Azul Zing ReadyNow, which allows the profiling information to be saved from a previous run and re-applied on startup to improve the warm-up time.

Finally we conclude by talking a bit about Project Leyden from Oracle.

CoralSequencer with -Xcomp -XX:-TieredCompilation

Below we explore the difference in performance that -Xcomp -XX:-TieredCompilation makes for the CoralSequencer benchmark latency numbers.

Benchmark Environment

$ java -version
java version "21.0.1" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing)

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS \n \l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

Regular JIT with warm-up

Iterations: 1,000 | Avg Time: 4.372 micros | Min Time: 3.634 micros | Max Time: 137.246 micros | 75% = [avg: 3.958 micros, max: 4.391 micros] | 90% = [avg: 4.035 micros, max: 4.474 micros] | 99% = [avg: 4.127 micros, max: 6.152 micros] | 99.9% = [avg: 4.239 micros, max: 22.849 micros] | 99.99% = [avg: 4.372 micros, max: 137.246 micros] | 99.999% = [avg: 4.372 micros, max: 137.246 micros]

Regular JIT without warm-up

Iterations: 1,000 | Avg Time: 58.026 micros | Min Time: 26.206 micros | Max Time: 2.809 millis | 75% = [avg: 36.13 micros, max: 52.503 micros] | 90% = [avg: 40.976 micros, max: 80.534 micros] | 99% = [avg: 46.905 micros, max: 324.93 micros] | 99.9% = [avg: 55.272 micros, max: 2.23 millis] | 99.99% = [avg: 58.026 micros, max: 2.809 millis] | 99.999% = [avg: 58.026 micros, max: 2.809 millis]

-Xcomp -XX:-TieredCompilation with warm-up

Iterations: 1,000 | Avg Time: 6.803 micros | Min Time: 5.741 micros | Max Time: 97.289 micros | 75% = [avg: 6.443 micros, max: 6.737 micros] | 90% = [avg: 6.499 micros, max: 6.88 micros] | 99% = [avg: 6.6 micros, max: 10.323 micros] | 99.9% = [avg: 6.712 micros, max: 24.872 micros] | 99.99% = [avg: 6.802 micros, max: 97.289 micros] | 99.999% = [avg: 6.802 micros, max: 97.289 micros]

-Xcomp -XX:-TieredCompilation without warm-up

Iterations: 1,000 | Avg Time: 7.005 micros | Min Time: 6.029 micros | Max Time: 126.315 micros | 75% = [avg: 6.545 micros, max: 6.994 micros] | 90% = [avg: 6.625 micros, max: 7.084 micros] | 99% = [avg: 6.737 micros, max: 11.505 micros] | 99.9% = [avg: 6.885 micros, max: 47.461 micros] | 99.99% = [avg: 7.005 micros, max: 126.315 micros] | 99.999% = [avg: 7.005 micros, max: 126.315 micros]

As you can see from the latency numbers above, by using -Xcomp -XX:-TieredCompilation we can mitigate the warm-up time by paying a price in performance (average 7.005 micros over average 4.372 micros). It emphasizes the advantages of runtime information (i.e. profiling) when it comes to the critical path optimizations performed by the HotSpot JIT compiler. Without profiling, there is only so much an AOT compiler can do, and the most aggressive optimizations might not be able to be applied beforehand. Of course this conclusion cannot be generalized for every application as it will depend heavily on the characteristics and particularities of the source code and its critical path.

CoralSequencer with Azul Zing ReadyNow

We performed three training runs of our application to record three generations of the ReadyNow profile log as instructed by the ReadyNow guide. On each training run we performed 3 million iterations of the critical path. The size of the final profile log was 4.5 megabytes.

Benchmark Environment

$ java -version
openjdk version "21.0.4" 2024-09-27 LTS
OpenJDK Runtime Environment Zing24.09.0.0+5 (build 21.0.4+4-LTS)
Zing 64-Bit Tiered VM Zing24.09.0.0+5 (build 21.0.4-zing_24.09.0.0-b5-release-linux-X86_64, mixed mode)

$ uname -a
Linux hivelocity 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue | head -n 1
Ubuntu 18.04.6 LTS \n \l

$ cat /proc/cpuinfo | grep "model name" | head -n 1 | awk -F ": " '{print $NF}'
Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz

Regular Zing JIT with warm-up

Iterations: 1,000 | Avg Time: 4.175 micros | Min Time: 3.317 micros | Max Time: 90.359 micros | 75% = [avg: 3.841 micros, max: 4.095 micros] | 90% = [avg: 3.923 micros, max: 4.418 micros] | 99% = [avg: 3.997 micros, max: 5.768 micros] | 99.9% = [avg: 4.088 micros, max: 20.067 micros] | 99.99% = [avg: 4.175 micros, max: 90.359 micros] | 99.999% = [avg: 4.175 micros, max: 90.359 micros]

Regular Zing JIT without warm-up

Iterations: 1,000 | Avg Time: 55.314 micros | Min Time: 21.089 micros | Max Time: 5.628 millis | 75% = [avg: 33.077 micros, max: 53.887 micros] | 90% = [avg: 38.055 micros, max: 75.808 micros] | 99% = [avg: 43.08 micros, max: 133.613 micros] | 99.9% = [avg: 49.735 micros, max: 2.563 millis] | 99.99% = [avg: 55.314 micros, max: 5.628 millis] | 99.999% = [avg: 55.314 micros, max: 5.628 millis]

Zing ReadyNow with warm-up

Iterations: 1,000 | Avg Time: 4.273 micros | Min Time: 3.396 micros | Max Time: 94.648 micros | 75% = [avg: 3.905 micros, max: 4.126 micros] | 90% = [avg: 3.987 micros, max: 4.501 micros] | 99% = [avg: 4.066 micros, max: 6.042 micros] | 99.9% = [avg: 4.182 micros, max: 21.433 micros] | 99.99% = [avg: 4.272 micros, max: 94.648 micros] | 99.999% = [avg: 4.272 micros, max: 94.648 micros]

Zing ReadyNow without warm-up

Iterations: 1,000 | Avg Time: 29.47 micros | Min Time: 18.966 micros | Max Time: 279.449 micros | 75% = [avg: 24.666 micros, max: 32.724 micros] | 90% = [avg: 26.66 micros, max: 42.978 micros] | 99% = [avg: 28.679 micros, max: 71.379 micros] | 99.9% = [avg: 29.219 micros, max: 141.065 micros] | 99.99% = [avg: 29.469 micros, max: 279.449 micros] | 99.999% = [avg: 29.469 micros, max: 279.449 micros]

As you can see from the latency numbers above, by using ReadyNow we were able to obtain a 50% improvement in performance without warm-up (average 29.470 micros over average 55.314 micros). We were also able to limit the outliers above the 99.9 percentile (max 141.065 micros over max 2.563 millis) before warming up. After warming up, the results with ReadyNow were similar to the results without ReadyNow (average 4.273 micros over average 4.175 micros and min 3.396 micros over min 3.317 micros).

Project Leyden from Oracle

At the time of this article (Nov/2024) Project Leyden from Oracle is fairly new (May/2022) but it has been making some great progress in the Java warm-up area. Our opinion, based on our own experiments with CoralSequencer and GraalVM, is that AOT is not as fast as JIT (after the code has warmed up) so real-time and past-time (archived) profiling information becomes crucial for achieving AOT + JIT maximum performance together with minimum time-to-peak (i.e. quick warm-up). It is important to emphasize that this AOT x JIT conclusion is targeted to CoralSequencer in particular and cannot be generalized for every application as it will depend heavily on the characteristics of the source code and its critical path. Also it is important to clarify that by fast we mean the ability to accomplish the lowest possible latency for the critical path. We are not referring to throughput, JVM start-up time and application start-up time.

That said, we are particularly excited about the following JEPs:

JEP draft 8325147: Ahead-of-Time Method Profiling => Method profiles from training runs are stored in the CDS archive, thereby enabling the JIT to begin compiling earlier during warmup. As a result, Java applications can reach peak performance faster. This feature is enabled by the VM flags -XX:+RecordTraining and -XX:+ReplayTraining.
JEP draft 8335368: Ahead-of-Time Code Compilation => Methods that are frequently used during the training run can be compiled and stored along with the CDS archive. As a result, as soon as the application starts up in the production run, its methods can be can be natively executed. This feature is enabled by the VM flags -XX:+StoreCachedCode, -XX:+LoadCachedCode, and -XX:CachedCodeFile.

We are currently working to be able to test CoralSequencer with the early-access builds of Project Leyden and we’ll report our findings soon.

Exploring the Sequencer Architecture through our SimSequencer

cb — Thu, 08 Feb 2024 02:33:09 +0000

SimSequencer is a framework that lets you simulate through code the majority of the aspects of the sequencer architecture without any networking involved. It can be very useful for prototyping, testing and learning purposes. With SimSequencer you can code your nodes to test all the moving parts of your application as they were running and interacting in a real distributed system. And it has the same API of CoralSequencer.

Some of the main premises of the sequencer architecture are:

All nodes receive all messages in the exact same order, always
Nodes become deterministic finite-state machines
Clusters for high-availability and failover become trivial

/wp-content/uploads/videos/CoralSequencer.mp4

Below we demonstrate the sequencer architecture main premises through working SimSequencer code:

All nodes receive all messages in the exact same order, always

  Sequencer sequencer = new PassThroughSequencer("SEQ");
  
  final List messages1 = new LinkedList();
  final List messages2 = new LinkedList();
  
  Node node1 = new Node("NODE1") {
    @Override
    protected void handleMessage(boolean isMine, Message msg) {
      messages1.add(msg);
    }
  };
  
  Node node2 = new Node("NODE2") {
    @Override
    protected void handleMessage(boolean isMine, Message msg) {
      messages2.add(msg);
    }
  };
  
  sequencer.addNode(node1, node2);
  
  sequencer.open().activate();
  node1.open().activate();
  node2.open().activate();
  
  Random rand = new Random();
  
  final int messagesToSend = 200;
  
  for(int i = 0; i < messagesToSend; i++) {
    Node n = rand.nextBoolean() ? node1 : node2;
    n.sendCommand("Hi" + rand.nextInt(1000));
  }
  
  Assert.assertEquals(messagesToSend, messages1.size());
  Assert.assertEquals(messagesToSend, messages2.size());
  
  Iterator iter1 = messages1.iterator();
  Iterator iter2 = messages2.iterator();
  
  while(iter1.hasNext() && iter2.hasNext()) {
    Message m1 = iter1.next();
    Message m2 = iter2.next();
    Assert.assertTrue(m1.isEqualTo(m2));
  }
  
  node1.close();
  node2.close();
  sequencer.close();

Nodes become deterministic finite-state machines

import static com.coralblocks.simsequencer.util.Log.*;

import com.coralblocks.simsequencer.Message;
import com.coralblocks.simsequencer.Node;

public class CounterNode extends Node {
	
	private long counter;

	public CounterNode(String name) {
		super(name);
	}
	
	@Override
	protected void handleOpened() {
		this.counter = 1;
		Info.log(name, "Counter was reset to 1");
	}
	
	public long getCounter() {
		return counter;
	}
	
	@Override
	protected void handleMessage(boolean isMine, Message msg) {
		
		String[] tokens = msg.getDataAsString().split("\\|");
		
		String type = tokens[0];
		
		if (type.equals("ADD_DETERMINISTIC_TIMESTAMP")) {
			counter += currentSequencerTime();
		} else if (type.equals("ADD_VALUE")) {
			counter += Integer.parseInt(tokens[1]);
		}
	}
}

  Sequencer sequencer = new PassThroughSequencer("SEQ");
  
  CounterNode node = new CounterNode("NODE1");
  
  sequencer.addNode(node);

  sequencer.open().activate();
  node.open().activate();
  
  Random rand = new Random();

  final int messagesToSend = 200;

  for(int i = 0; i < messagesToSend; i++) {
      int type = rand.nextInt(2);
      if (type == 0) {
        node.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
      } else if (type == 1) {
        node.sendCommand("ADD_VALUE|" + rand.nextInt(300000));
      }
  }
  
  final long counter = node.getCounter();

  for(int i = 0; i < 5; i++) {
    node.close();
    node.open(); // resets the counter to 1...
    // node then rewinds, receiving all messages again...
    Assert.assertEquals(counter, node.getCounter());
  }

  node.close();
  sequencer.close();

Clusters for high-availability and failover become trivial

  // Hot-Warm (Active-Passive) Cluster
  
  Sequencer sequencer = new PassThroughSequencer("SEQ");
  
  // Two nodes with the same account "NODE1" for a cluster
  CounterNode nodeA = new CounterNode("NODE1");
  CounterNode nodeB = new CounterNode("NODE1");
  
  sequencer.addNode(nodeA, nodeB);
  
  sequencer.open().activate();
  
  nodeA.open().activate(); // hot (active)
  nodeB.open(); // warm (passive)

  Random rand = new Random();

  final int messagesToSend = 200;

  for(int i = 0; i < messagesToSend; i++) {
    int type = rand.nextInt(2);
    if (type == 0) {
      nodeA.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
    } else if (type == 1) {
      nodeA.sendCommand("ADD_VALUE|" + rand.nextInt(300000));
    }
  }
  
  Assert.assertEquals(nodeA.getCounter(), nodeB.getCounter());
  
  // fail over to the warm node
  nodeA.deactivate(); // now it is warm
  nodeB.activate(); // now it is hot
  
  for(int i = 0; i < messagesToSend; i++) {
    int type = rand.nextInt(2);
    if (type == 0) {
      nodeB.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
    } else if (type == 1) {
      nodeB.sendCommand("ADD_VALUE|" + rand.nextInt(300000));
    }
  }
  
  Assert.assertEquals(nodeA.getCounter(), nodeB.getCounter());

  nodeA.close();
  nodeB.close();
  sequencer.close();

  // Hot-Hot (Active-Active) Cluster

  Sequencer sequencer = new PassThroughSequencer("SEQ");

  // Two nodes with the same account "NODE1" for a cluster
  CounterNode nodeA = new CounterNode("NODE1");
  CounterNode nodeB = new CounterNode("NODE1");
  
  sequencer.addNode(nodeA, nodeB);
  
  sequencer.open().activate();

  nodeA.open().activate(); // hot (active)
  nodeB.open().activate(); // hot (active)
  
  Random rand = new Random();

  final int messagesToSend = 200;
  
  for(int i = 0; i < messagesToSend; i++) {
    int type = rand.nextInt(2);
    if (type == 0) {
      if (rand.nextBoolean()) { // order does not matter
        nodeA.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
        nodeB.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
      } else {
        nodeB.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
        nodeA.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
      }
    } else if (type == 1) {
      int value = rand.nextInt(300000);
      if (rand.nextBoolean()) { // order does not matter
        nodeA.sendCommand("ADD_VALUE|" + value);
        nodeB.sendCommand("ADD_VALUE|" + value);
      } else {
        nodeB.sendCommand("ADD_VALUE|" + value);
        nodeA.sendCommand("ADD_VALUE|" + value);
      }
    }
  }

  Assert.assertEquals(nodeA.getCounter(), nodeB.getCounter());

  // pull the plug from one of the nodes
  nodeA.close(); // now it is dead
  nodeA = null;

  for(int i = 0; i < messagesToSend; i++) {
    int type = rand.nextInt(2);
    if (type == 0) {
      nodeB.sendCommand("ADD_DETERMINISTIC_TIMESTAMP");
    } else if (type == 1) {
      nodeB.sendCommand("ADD_VALUE|" + rand.nextInt(300000));
    }
  }

  // bring another node to the cluster...
  CounterNode nodeC = new CounterNode("NODE1");
  sequencer.addNode(nodeC);
  nodeC.open().activate(); // hot (active)

  Assert.assertEquals(nodeB.getCounter(), nodeC.getCounter());

  nodeB.close();
  nodeC.close();
  sequencer.close();

CoralSequencer’s structured data serialization framework

cb — Sun, 27 Jun 2021 14:25:38 +0000

CoralSequencer uses its own binary and garbage-free serialization framework to read and write its internal messages. For your application messages, you are free to use any serialization library or binary data model you choose. The fact that CoralSequencer is message agnostic gives you total flexibility in that decision. But you can also consider using CoralSequencer’s native serialization framework described in this article.

To define the schema of a message you simply inherit from AbstractProto and define the message data fields. Below a self-explanatory example:

import com.coralblocks.coralsequencer.protocol.AbstractProto;
import com.coralblocks.coralsequencer.protocol.field.*;

public class OrderNew extends AbstractProto {
	
	private static final int SYMBOL_LENGTH = 8;

	public static final char 	TYPE 	= 'O';
	public static final char 	SUBTYPE = 'N';
	
	public final TypeField typeField = new TypeField(this, TYPE);
    public final SubtypeField subtypeField = new SubtypeField(this, SUBTYPE);
	
	public	final	CharsField		symbol = new CharsField(this, SYMBOL_LENGTH);
	public	final	CharField		side = new CharField(this);
	public	final	LongField		size = new LongField(this);
	public 	final	LongField		price = new LongField(this);
	public	final	LongField		myTimestamp = new LongField(this);
	public	final	LongField		splitTimestamp = new LongField(this);
	public	final	BooleanField	isLastChild = new BooleanField(this);
}

To send out the OrderNew message above, you can simply re-use the same OrderNew instance, over and over again, creating zero garbage. Below an example of how you would send an OrderNew message from a CoralSequencer node:

			if (topBook.isSignal.get()) {
				
				long splitTimestamp = useEpoch ? timestamper.nanoEpoch() : timestamper.nanoTime();

				for(int i = 0; i < ordersToSend; i++) {

					boolean isBid = (i % 2 == 0);
					
					orderNew.symbol.set(topBook.symbol.get());
					
					if (isBid) {
						orderNew.side.set('B');
						orderNew.size.set(topBook.bidSize.get());
						orderNew.price.set(topBook.bidPrice.get());
					} else {
						orderNew.side.set('S');
						orderNew.size.set(topBook.askSize.get());
						orderNew.price.set(topBook.askPrice.get());
					}
					
					orderNew.myTimestamp.set(useEpoch ? timestamper.nanoEpoch() : timestamper.nanoTime());
					orderNew.splitTimestamp.set(splitTimestamp);
					
					orderNew.isLastChild.set(i == ordersToSend - 1);
					
					if (batching) {
						writeCommand(orderNew);
					} else {
						sendCommand(orderNew);
					}
				}
				
				if (batching) flush();
			}

As you can see, you simply populate the fields with data and call the sendCommand(Proto) method of a CoralSequencer node.

Now to receive a CoralSequencer Proto message, you first need to define a parser for your Proto messages. Luckily that’s super easy as you can see below:

import com.coralblocks.coralsequencer.protocol.AbstractMessageProtoParser;
import com.coralblocks.coralsequencer.protocol.Proto;

public class ProtoParser extends AbstractMessageProtoParser {

	@Override
    protected Proto[] createProtoMessages() {
	    return new Proto[] { new OrderNew(), new TopBook(), new OrderCancel() };
    }
}

Then you can use the proto parser above inside your Node’s handleMessage method to parse a Proto message out of a CoralSequencer message:

    private final ProtoParser protoParser = new ProtoParser();

	@Override
    protected void handleMessage(Message msg) {
	    
		if (isRewinding()) return; // do nothing during rewind...
		
		char type = protoParser.getType(msg);
		char subtype = protoParser.getSubtype(msg);
		
		if (type == OrderNew.TYPE && subtype == OrderNew.SUBTYPE) {
			
			OrderNew orderNew = (OrderNew) protoParser.parse(msg);
			
			if (orderNew == null) {
				Error.log(name, "Can't parse OrderNew:", msg.toCharSequence());
				return;
			}
			
			long now = useEpoch ? timestamper.nanoEpoch() : timestamper.nanoTime();
			
			long latency = now - orderNew.myTimestamp.get();
			
			ordersBench.measure(latency);

			if (orderNew.isLastChild.get()) {
				latency = now - orderNew.splitTimestamp.get();
				splitBench.measure(latency);
			}
		}
    }

Cool! That’s great! So what is the downside of using CoralSequencer’s serialization framework? To keep things simple, super fast and garbage-free, it does not give you any help for schema evolution, versioning and backwards compatibility. You are able to add (i.e. append) new fields to an existing message without having to update all nodes, but if you attempt to remove a field or change the order of the fields appearing in a message, then all nodes of your distributed system will have to be updated with the new schema class code in order to use/support the new message format. There is also support for IDL, optional fields and repeating groups.

Nodes (CoralSequencer article series)

cb — Tue, 23 Feb 2016 23:05:37 +0000

In a distributed system, Nodes are responsible for executing the application logic in a decentralized/distributed way. With CoralSequencer you can easily code a node that will send commands to the sequencer and listen to messages in the event-stream (i.e. message-bus). Below we show an example of a simple node that sends a TIME command to the sequencer and waits to see the corresponding message in the event-stream. When it receives the message, it waits 3 seconds and sends another command, repeating the process.

package com.coralblocks.coralsequencer.node;

import java.nio.ByteBuffer;

import com.coralblocks.coralbits.ts.TimeUnit;
import com.coralblocks.coralbits.util.ByteBufferUtils;
import com.coralblocks.coralreactor.nio.NioReactor;
import com.coralblocks.coralreactor.util.Configuration;
import com.coralblocks.coralsequencer.message.Message;
import com.coralblocks.coralsequencer.mq.Node;

public class SampleNode extends Node {
	
	private final static int PERIOD = 3; // 3 seconds...
	
	public SampleNode(NioReactor nio, String name, Configuration config) {
	    super(nio, name, config);
    }
	
	@Override
	protected void handleActivated() {
		// this method is called when the node becomes active
		sendCommand();
	}
	
	@Override
	protected void handleDeactivated() {
		// called when a node has been deactivated
		// once deactivated a node will not send commands
		removeEventTimeout(); // turn off event timeout if set
	}
	
	@Override
	protected void handleEventTimeout(long now, long period, TimeUnit unit) {
		// this method is triggered by the event timeout you are setting in the handleMessage method
		// Note: it is triggered only once so you must re-register the timeout if you want to do it again (it is not a loop timer)
		sendCommand();
	}
	
	private void sendCommand() {
		sendCommand("TIME-" + System.currentTimeMillis());
	}

	@Override
    protected void handleMessage(boolean isMine, Message msg) {
		
		if (!isMine || isRewinding()) return; // not interested, quickly ignore them...
		
		ByteBuffer data = msg.getData();
		
		System.out.println("Saw my message in the event-stream: " + ByteBufferUtils.parseString(data));
		
		setEventTimeout(PERIOD, TimeUnit.SECONDS); // set a trigger to send the command again after 3 seconds
    }
}

Note that every command sent to the sequencer by a node will make the sequencer send a corresponding message to the event-stream with the sender’s account. That’s how the node makes sure that its command was received and processed by the sequencer. You don’t need to worry about that or do anything, but under the hood CoralSequencer will resend the command if it does not see the corresponding message (i.e. the ack) in the event-stream after N milliseconds. Again, this is totally transparent for the developer coding the node, as you can see in the source code above.

A node can be read-only, in other words, it will only listen to the event-stream and never send any command to the sequencer. Our node above does send a command to the sequencer (i.e. it is not read-only) using the sendCommand(String) method. Besides that method, you can also use sendCommand(byte[]), sendCommand(ByteBuffer) and sendCommand(Proto).

The isMine flag passed by the handleMessage(boolean, Message) method is important as it tells you if you are receiving a message that belongs to this node or not. Recall that the sequencer always broadcasts all messages to all nodes, so you will be seeing messages from other nodes in this method. If you are only interested in your messages you can quickly drop them by checking the isMine boolean.

Another important check is done with isRewinding(). The first time the node connects to the sequencer it will receive a replay of all the previous messages from the current session, in a process called rewinding. You can use this messages to rebuild state if you need to. In our simple example we don’t want to do anything with past messages so we simply drop them.

Configuring the Node

Below the CoralSequencer DSL to configure a node:

# allow this node to be managed by telnet
VM addAdmin telnet 51

# creates the node (account = NODE1)
VM newNode NODE1 com.coralblocks.coralsequencer.node.SampleNode

# the lines below can also be executed manually by admin
NODE1 open
NODE1 activate

Add the lines above to a file time.mq and use the script ./bin/start.sh to execute the DSL and start the node:

Managing the Node

Because we configured the telnet admin in the DSL above, we can telnet to the admin port (i.e. 50000 + port id) to execute DSL commands on the node. For example, the last lines to open and activate the node can be executed manually through telnet, as the example below shows:

NOTE: The highlighted lines below are the commands executed

$ telnet localhost 50051
Trying ::1...
Connected to localhost (::1).
Escape character is '^]'.

Hi! What can I do for you? You can start by typing 'list'...

list NODE1

NODE1 open
NODE1 close
NODE1 setMessageReceiver
NODE1 setCommandSender
NODE1 activate
NODE1 sendCommand
NODE1 status

NODE1-CommandSender-255.255.255.255:60010
NODE1-MessageReceiver-0.0.0.0:60066

NODE1 open

NODE1 was opened!

NODE1 activate true

activate called!

Conclusion

Writing a Node using CoralSequencer is extremely easy. Moreover you can configure and manage your nodes using CoralSequencer’s DSL.