Coral Blocks » CoralQueue

Getting Started with CoralQueue

cb — Mon, 16 Jun 2014 17:45:18 +0000

CoralQueue is a ultra-low-latency lock-free and garbage-free queue for inter-thread communication. It can be defined as a batching queue backed up by a circular array (i.e. the ring buffer) filled with pre-allocated transfer objects which uses memory-barriers to synchronize producers and consumers through sequences. Fortunately you don’t have to understand all its intrinsic details to use it. In this article we show how to use CoralQueue to send messages from a producer thread to a consumer thread in a very fast way without producing any garbage.

The Queue

The queue is a circular array with pre-allocated transfer objects. For the example below we use a StringBuilder as the transfer object.

Queue queue = new AtomicQueue(1024, StringBuilder.class);

The code above creates a queue with 1024 pre-allocated StringBuilders. Note that it uses the default constructor of StringBuilder which by default creates a StringBuilder with size 16. That may be too small for our transfer objects and we don’t want the StringBuilder resizing itself during runtime. So to create a bigger StringBuilder we can use a com.coralblocks.coralbits.util.Builder class like below:

Builder builder = new Builder() {
	@Override
    public StringBuilder newInstance() {
		return new StringBuilder(1024);
    }
};

final Queue queue = new AtomicQueue(1024, builder);

Sending Messages

To send a message to the queue, you grab a transfer object from the queue, fill it with your data and call flush() as the code below illustrates:

StringBuilder sb;
while((sb = queue.nextToDispatch()) == null); // busy spin...
sb.setLength(0);
sb.append("Hello there!");
queue.flush();

If the queue is full we just busy spin until a transfer object becomes available. Later we will see how we can also use a WaitStrategy instead of busy spinning.

You can also send messages in batches:

StringBuilder sb;

while((sb = queue.nextToDispatch()) == null); // busy spin...
sb.setLength(0);
sb.append("Hello there!");

while((sb = queue.nextToDispatch()) == null); // busy spin...
sb.setLength(0);
sb.append("Hello again!");

queue.flush();

Polling Messages

To read message from the queue you poll them from a consumer thread, as the code below shows:

long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
    StringBuilder sb = queue.poll();
    // do whatever you want with the StringBuilder
    // just do not create garbage
    // copy char-by-char instead
}
queue.donePolling();

Again we busy spin if the queue is empty. Later we will see how we can also use a WaitStrategy instead of busy spinning.

Note that we poll in batches, reducing the number of times we have to check for an empty queue through availableToPoll().

Wait Strategies

By default, you should busy-spin when the queue is full or empty respectively. That’s usually the fastest approach but not always the best as you might want to save energy or allow other threads to use the idle processor. CoralQueue comes with many wait strategies that you can use instead of busy-spinning, or you can create your owns by implementing the WaitStrategy interface. Below are some examples of wait strategies that come with CoralQueue:

ParkWaitStrategy: park (i.e. sleep) for 1 nanosecond with the option to back off up to a maximum of N nanoseconds. N defaults to 1 million nanoseconds if not specified (1 millisecond).
SpinParkWaitStrategy: first busy spins for C cycles (default to 1 million cycles) then it starts to park (i.e. sleep) for 1 nanosecond with the option to back off up to a maximum of N nanoseconds (default 1 million nanoseconds).
SpinYieldParkWaitStrategy: busy spins for some cycles, yield for some cycles then starts to sleep for 1 nanosecond with the option to back off up to a maximum of N nanoseconds (defaults to 1 million nanoseconds).

To use a wait strategy, all you have to do is call its block and reset methods instead of busy spinning:

WaitStrategy waitStrategy = new ParkWaitStrategy();
StringBuilder sb;
while((sb = queue.nextToDispatch()) == null) {
    waitStrategy.block();
}
sb.setLength(0);
sb.append("Hello there!");
queue.flush();
waitStrategy.reset(); // you can reset here to save some nanoseconds...

Same thing when polling:

WaitStrategy waitStrategy = new SpinParkWaitStrategy();
long avail;
while((avail = queue.availableToPoll()) == 0) {
    waitStrategy.block();
}
for(int i = 0; i < avail; i++) {
    StringBuilder sb = queue.poll();
    // do whatever you want with the StringBuilder
    // just do not create garbage
    // copy char-by-char instead
}
queue.donePolling();
waitStrategy.reset(); // you can reset here to save some nanoseconds...

Semi-volatile writes (lazySet)

To squeeze every bit of performance out of CoralQueue, you can use semi-volatile writes when sending and polling messages. Basically, a semi-volatile write is done through the lazySet method from java.util.concurrent.AtomicLong. It is a faster operation for the thread that’s modifying the variable at the expense of the thread that’s interested in knowing about updates in the variable. For example, if you want to minimize the latency in the producer, you should use lazySet. If you want to minimize the message transit time, you should not use lazySet so the consumer is notified as soon as possible about a new message in the queue.

By default, CoralQueue does not use lazySet, but you can easily take control of that by using the methods below:

queue.flush(); // no lazySet by default
queue.flush(true); // use lazySet

queue.donePolling(); // no lazySet by default
queue.donePolling(true); // use lazySet

Complete Example

To put all parts together, we write a simple program that send 10 timestamps to a consumer thread and then exits:

package com.coralblocks.coralqueue.sample.queue;

import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;

public class Sample {
	
	public static void main(String[] args) throws InterruptedException {
		
		final Queue queue = new AtomicQueue(1024, MutableLong.class);
		
		Thread consumer = new Thread() {
			
			@Override
			public void run() {
				
				boolean running = true;
				
				while(running) {
					long avail;
					while((avail = queue.availableToPoll()) == 0); // busy spin
					for(int i = 0; i < avail; i++) {
						MutableLong ml = queue.poll();
						if (ml.get() == -1) { // message to flag exit...
							running = false;
							break;
						}
						System.out.println(ml.get());
					}
					queue.donePolling();
				}
			}
			
		};
		
		consumer.start();
		
		MutableLong ml;
		
		for(int i = 0; i < 10; i++) {
			while((ml = queue.nextToDispatch()) == null); // busy spin
			ml.set(System.nanoTime());
			queue.flush();
		}
		
		// send a message to stop consumer...
		while((ml = queue.nextToDispatch()) == null); // busy spin
		ml.set(-1);
		queue.flush();
		
		consumer.join(); // wait for the consumer thread to die...
	}
}

Conclusion

CoralQueue makes the development of ultra-low-latency, lock-free and garbage-free multithreading applications easy by pipelining messages among threads. It also offers batching, semi-volatile writes and wait strategies though a simple API. CoralQueue also provides a multiplexer (multiple-producers to one-consumer), a demultiplexer (one-producer to multiple-consumers) and a mpmc queue (multiple-producers to multiple-consumers).

Demultiplexing with CoralQueue for Parallel Processing

cb — Wed, 18 Jun 2014 21:14:11 +0000

In this article we examine how CoralQueue implements a demultiplexer, in other words, a one-producer-to-multiple-consumers queue. We also present throughput numbers for the three types of implementations supported: Atomic, CAS and modulus.

The Demultiplexer

A demultiplexer is a queue that accepts a single producer sending messages and distributes them across a set of consumers so that each message is only processed once in the consumer side. Of course you should only use a demultiplexer (i.e. demux) if your messages can be processed in parallel and out of order. A demux is very useful to avoid queue contention and speed things up when consuming messages can be potentially slower than producing them. By increasing the number of consumers running in their own dedicated cpu cores, you increase the horsepower in the consumer side to compensate for slower consumers.

Below a simple example of a Demux:

package com.coralblocks.coralqueue.sample.demux;

import com.coralblocks.coralqueue.demux.AtomicDemux;
import com.coralblocks.coralqueue.demux.Consumer;
import com.coralblocks.coralqueue.demux.Demux;

public class SampleWithConsumer {

	private static final int NUMBER_OF_CONSUMERS = 4;
	
	public static void main(String[] args) throws InterruptedException {
		
		final Demux demux = new AtomicDemux(1024, StringBuilder.class, NUMBER_OF_CONSUMERS);
		
		Thread[] consumers = new Thread[NUMBER_OF_CONSUMERS];
		
		for(int i = 0; i < consumers.length; i++) {
			
    		consumers[i] = new Thread("Consumer-" + i) {
    			
    			private final Consumer consumer = demux.nextConsumer();
    			
    			@Override
    			public void run() {
    				
    				boolean running = true;
    				
    				while(running) {
    					long avail;
    					while((avail = consumer.availableToPoll()) == 0); // busy spin
    					for(int i = 0; i < avail; i++) {
    						StringBuilder sb = consumer.poll();
    						if (sb == null) break; // mandatory for demuxes!
    						if (sb.length() == 0) {
    							running = false;
    							break; // exit immediately...
    						}
    						System.out.println(sb.toString());
    					}
    					consumer.donePolling();
    				}
    			}
    		};
    		
    		consumers[i].start();
		}
		
		StringBuilder sb;
		
		for(int i = 0; i < 3; i++) {
			while((sb = demux.nextToDispatch()) == null); // busy spin
			sb.setLength(0);
			sb.append("message ").append(i);
			demux.flush();
		}
		
		// send a message to stop consumers...
		for(int i = 0; i < NUMBER_OF_CONSUMERS; i++) {
			// routing is being used here...
			while((sb = demux.nextToDispatch(i)) == null); // busy spin
			sb.setLength(0);
		}
		
		demux.flush(); // sent batch
		
		for(int i = 0; i < consumers.length; i++) consumers[i].join();
	}
}

NOTE: For demuxes it is mandatory to check if the object returned is null and break out of the loop. See line 33 in the example above.

Routing Messages

You can also choose to route a message to a specific consumer. To do that, all you have to do is call nextToDispatch(int consumerIndex) and you can be sure that your message will be processed by the specific consumer. That can be useful to partition certain types of messages to specific consumers and avoid processing them in parallel and out of order. You can also pass a negative number as the consumer index and the demux will fall back to sending the message to a random consumer.

Modulus vs Atomic vc CAS

Modulus is the fastest but it has the disadvantage that a slow consumer can cause the demux queue to fill up and block all consumers. Atomic and CAS do not have this drawback, in other words, a slow consumer will not affect the whole system as messages will be simply processed by other available consumers. Atomic maintains a queue for each consumer and supports routing. CAS does not support routing but has the ability to help a slow consumer by processing messages pending on his queue. Modulus does not support router either, only Atomic does.

Throughput Numbers

The machine used to run the benchmark tests was an Intel i7 quad-core (4 x 3.50GHz) Ubuntu box overclocked to 4.50Ghz.

ATOMIC: One producer pinned to its own core sending to two consumers, each pinned to its own core:

Results: Iterations: 20 | Avg Time: 390.42 millis | Min Time: 386.647 millis | Max Time: 395.172 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 390420106 nanos
Messages per second: 25,613,434

ATOMIC: One producer pinned to its own core sending to two consumers, each pinned to the same core through hyper-threading:

Results: Iterations: 20 | Avg Time: 388.403 millis | Min Time: 383.134 millis | Max Time: 392.386 millis | Nano Timing Cost: 15.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 388402712 nanos
Messages per second: 25,746,473

ATOMIC: One producer pinned to its own core sending to four consumers each two using one core through hyper-threading:

Results: Iterations: 20 | Avg Time: 506.881 millis | Min Time: 505.326 millis | Max Time: 509.197 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 506881400 nanos
Messages per second: 19,728,480

MODULUS: One producer pinned to its own core sending to two consumers, each pinned to its own core:

Results: Iterations: 20 | Avg Time: 207.266 millis | Min Time: 203.14 millis | Max Time: 209.389 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 207,266,446 nanos
Messages per second: 48,247,076

MODULUS: One producer pinned to its own core sending to two consumers, each pinned to the same core through hyper-threading:

Results: Iterations: 20 | Avg Time: 112.38 millis | Min Time: 111.737 millis | Max Time: 112.703 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 112,380,297 nanos
Messages per second: 88,983,569

MODULUS: One producer pinned to its own core sending to four consumers each two using one core through hyper-threading:

Results: Iterations: 20 | Avg Time: 133.334 millis | Min Time: 123.943 millis | Max Time: 140.385 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 133,334,460 nanos
Messages per second: 74,999,366

CAS: One producer pinned to its own core sending to two consumers, each pinned to its own core:

Results: Iterations: 20 | Avg Time: 583.802 millis | Min Time: 555.686 millis | Max Time: 588.794 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 583,802,394 nanos
Messages per second: 17,129,083

CAS: One producer pinned to its own core sending to two consumers, each pinned to the same core through hyper-threading:

Results: Iterations: 20 | Avg Time: 396.183 millis | Min Time: 393.696 millis | Max Time: 401.272 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 396,182,573 nanos
Messages per second: 25,240,888

CAS: One producer pinned to its own core sending to four consumers each two using one core through hyper-threading:

Results: Iterations: 20 | Avg Time: 567.278 millis | Min Time: 565.862 millis | Max Time: 569.079 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 567,277,788 nanos
Messages per second: 17,628,047

Conclusions

If you need more horsepower in the consumer side to keep up with the producer, you can increase the number of consumers and process your messages in parallel by using a demultiplexer. If you need to route messages to a specific consumer, you can also do that with CoralQueue’s demultiplexer.

Multiplexing with CoralQueue

cb — Wed, 18 Jun 2014 06:44:33 +0000

In this article we analyze how CoralQueue implements a multiplexer to allow multiple producers to send messages to a single consumer. We then present the throughput numbers for different set of configurations with different set of cpu cores.

The Multiplexer

Below an example of using the AtomicMux:

package com.coralblocks.coralqueue.sample.mux;

import com.coralblocks.coralbits.util.Builder;
import com.coralblocks.coralqueue.mux.AtomicMux;
import com.coralblocks.coralqueue.mux.Mux;
import com.coralblocks.coralqueue.mux.Producer;

public class SampleWithProducer {
	
	private static final int NUMBER_OF_PRODUCERS = 4;
	
	public static void main(String[] args) throws InterruptedException {
		
		Builder builder = new Builder() {
			@Override
            public StringBuilder newInstance() {
	            return new StringBuilder(1024);
            }
		};
		
		final Mux mux = new AtomicMux(1024, builder, NUMBER_OF_PRODUCERS);
		
		Thread[] producers = new Thread[NUMBER_OF_PRODUCERS];
		
		for(int i = 0; i < producers.length; i++) {
			
			producers[i] = new Thread(new Runnable() {
				
				private final Producer producer = mux.nextProducer();

				@Override
                public void run() {
					
					StringBuilder sb;
					
					for(int j = 0; j < 4; j++) {
						while((sb = producer.nextToDispatch()) == null); // busy spin
						sb.setLength(0);
						sb.append("message ").append(j).append(" from producer ").append(producer.getIndex());
						producer.flush();
					}
					
					// send final message
					while((sb = producer.nextToDispatch()) == null); // busy spin
					sb.setLength(0); // empty string to signal we are done
					producer.flush();
                }
			}, "Producer" + i);
		}
		
		Thread consumer = new Thread(new Runnable() {

			@Override
            public void run() {
				
				boolean running = true;
				int finalMessages = 0;
				
				while(running) {
					
					long avail;
					while((avail = mux.availableToPoll()) == 0); // busy spin
					for(int i = 0; i < avail; i++) {
						StringBuilder sb = mux.poll();
						if (sb.length() == 0) {
							if (++finalMessages == NUMBER_OF_PRODUCERS) {
								// and we are done!
								running = false;
								break;
							}
						} else {
							System.out.println(sb.toString());
						}
					}
					mux.donePolling();
				}
            }
		}, "Consumer");
		
		consumer.start();
		for(int i = 0; i < producers.length; i++) producers[i].start();
		
		consumer.join();
		for(int i = 0; i < producers.length; i++) producers[i].join();
	}
}

Throughput Numbers

The machine used to run the benchmark tests was an Intel i7 quad-core (4 x 3.50GHz) Ubuntu box overclocked to 4.50Ghz.

Two producers pinned to their own cores sending messages to one consumer pinned to its own core:

Results: Iterations: 20 | Avg Time: 613.93 millis | Min Time: 583.798 millis | Max Time: 639.488 millis | Nano Timing Cost: 15.0 nanos
Average time to send 20,000,000 messages per pass in 20 passes: 613,929,765 nanos
Messages per second: 32,577,016

Two producers pinned to the same core with hyper-threading sending messages to one consumer pinned to its own core:

Results: Iterations: 20 | Avg Time: 560.601 millis | Min Time: 535.936 millis | Max Time: 576.715 millis | Nano Timing Cost: 14.0 nanos
Average time to send 20,000,000 messages per pass in 20 passes: 560,601,268 nanos
Messages per second: 35,675,980

Four producers pinned to two cores with hyper-threading sending messages to one consumer pinned to its own core:

Results: Iterations: 20 | Avg Time: 1.061 secs | Min Time: 1.03 secs | Max Time: 1.091 secs | Nano Timing Cost: 14.0 nanos
Average time to send 40,000,000 messages per pass in 20 passes: 1,060,708,245 nanos
Messages per second: 37,710,652

Conclusions

With CoralQueue you can easily send message from multiple producers to a single consumer. Its throughput numbers are between 30 to 40 million messages per second.

Multiple-Producers to Multiple-Consumers Queue

cb — Sun, 13 Mar 2016 18:06:47 +0000

In this article we give an example of how to use the MpmcQueue so you can transfer messages between any number of producers and consumers through a lock-less concurrent queue.

The Sample Code

Without further ado we list a sample code below:

package com.coralblocks.coralqueue.sample.mpmc;

import java.util.concurrent.atomic.AtomicLong;

import com.coralblocks.coralqueue.mpmc.Consumer;
import com.coralblocks.coralqueue.mpmc.MpmcQueue;
import com.coralblocks.coralqueue.mpmc.Producer;

public class Sample {

	private static final int NUMBER_OF_PRODUCERS = 4;
	private static final int NUMBER_OF_CONSUMERS = 2;
	
	public static void main(String[] args) throws InterruptedException {
		
		final MpmcQueue mpmc = new MpmcQueue(1024, StringBuilder.class, NUMBER_OF_PRODUCERS, NUMBER_OF_CONSUMERS);
		
		Thread[] consumers = new Thread[NUMBER_OF_CONSUMERS];
		
		for(int i = 0; i < consumers.length; i++) {
			
    		consumers[i] = new Thread("Consumer-" + i) {
    			
    			private final Consumer consumer = mpmc.nextConsumer();
    			
    			@Override
    			public void run() {
    				
    				boolean running = true;
    				
    				while(running) {
    					long avail;
    					while((avail = consumer.availableToPoll()) == 0); // busy spin
    					for(int i = 0; i < avail; i++) {
    						StringBuilder sb = consumer.poll();
    						if (sb == null) break; // mandatory for mpmc!
    						if (sb.length() == 0) {
								running = false;
								break; // exit immediately...
    						}
    						System.out.println(sb.toString() + " got by consumer " + consumer.getIndex());
    					}
    					consumer.donePolling();
    				}
    			}
    		};
    		
    		consumers[i].start();
		}
		
		Thread[] producers = new Thread[NUMBER_OF_PRODUCERS];
		
		final AtomicLong counter = new AtomicLong(0);
		
		for(int i = 0; i < producers.length; i++) {
			
    		producers[i] = new Thread("Producer-" + i) {
    			
    			private final Producer producer = mpmc.nextProducer();
    			
    			@Override
    			public void run() {

    				StringBuilder sb;
    				
    				for(int i = 0; i < 3; i++) {
    					while((sb = producer.nextToDispatch()) == null); // busy spin
    					long msgNumber = counter.getAndIncrement();
    					sb.setLength(0);
    					sb.append("message ").append(msgNumber);
    					System.out.println("sending message " + msgNumber + " from producer " + producer.getIndex());
    					producer.flush();
    				}
    			}
    		};
    		
    		producers[i].start();
		}
		
		for(int i = 0; i < producers.length; i++) producers[i].join();
		
		Thread.sleep(4000);
		
		Producer p = mpmc.getProducer(0);
	
		for(int i = 0; i < consumers.length; i++) {
			StringBuilder sb;
			// send a message to stop consumers...
			// routing is being used here...
			while((sb = p.nextToDispatch(i)) == null); // busy spin
			sb.setLength(0);
		}
		
		p.flush();
		
		for(int i = 0; i < consumers.length; i++) consumers[i].join();
	}
}

Routing Messages

You can also choose to route a message to a specific consumer. To do that, all you have to do is call nextToDispatch(int consumerIndex) and you can be sure that your message will be processed by the specific consumer. That can be useful to partition certain types of messages to specific consumers and avoid processing them in parallel and out of order. You can also pass a negative number as the consumer index and the mpmc queue will fall back to sending the message to a random consumer.

Conclusion

With CoralQueue you can easily build an architecture where multiple producers send messages to multiple consumers in a lock-free, super fast way.

Multicasting with CoralQueue through a Splitter

cb — Tue, 17 Jun 2014 16:19:06 +0000

In this article we show how to use CoralQueue to multicast/broadcast the same message to multiple consumers so each consumer receives and processes all messages. We also present throughput numbers for different configurations, each one using a different set of cpu cores.

The Splitter

Below an example of how to use the Splitter:

package com.coralblocks.coralqueue.sample.splitter;

import com.coralblocks.coralqueue.splitter.AtomicSplitter;
import com.coralblocks.coralqueue.splitter.Consumer;
import com.coralblocks.coralqueue.splitter.Splitter;

public class SampleWithConsumer {

	private static final int NUMBER_OF_CONSUMERS = 4;
	
	public static void main(String[] args) throws InterruptedException {
		
		final Splitter splitter = new AtomicSplitter(1024, StringBuilder.class, NUMBER_OF_CONSUMERS);
		
		Thread[] consumers = new Thread[NUMBER_OF_CONSUMERS];
		
		for(int i = 0; i < consumers.length; i++) {
			
    		consumers[i] = new Thread("Consumer-" + i) {
    			
    			private final Consumer consumer = splitter.nextConsumer();
    			
    			@Override
    			public void run() {
    				
    				boolean running = true;
    				
    				while(running) {
    					long avail;
    					while((avail = consumer.availableToPoll()) == 0); // busy spin
    					for(int i = 0; i < avail; i++) {
    						StringBuilder sb = consumer.poll();
    						if (sb == null) break; // mandatory for splitters!
    						if (sb.length() == 0) {
    							running = false;
    							break; // exit immediately...
    						}
    						System.out.println("got " + sb.toString() + " at consumer " + consumer.getIndex());
    					}
    					consumer.donePolling();
    				}
    			}
    		};
    		
    		consumers[i].start();
		}
		
		StringBuilder sb;
		
		for(int i = 0; i < 3; i++) {
			while((sb = splitter.nextToDispatch()) == null); // busy spin
			sb.setLength(0);
			sb.append("message ").append(i);
			splitter.flush();
		}
		
		// send a message to stop consumers...
		while((sb = splitter.nextToDispatch()) == null); // busy spin
		sb.setLength(0);
		splitter.flush();
		
		for(int i = 0; i < consumers.length; i++) consumers[i].join();
	}
}

Throughput Numbers

The machine used to run the benchmark tests was an Intel i7 quad-core (4 x 3.50GHz) Ubuntu box overclocked to 4.50Ghz.

One producer pinned to its own core sending to two consumers, each pinned to its own core:

Results: Iterations: 20 | Avg Time: 206.235 millis | Min Time: 203.929 millis | Max Time: 207.908 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 206,235,222 nanos
Messages per second: 48,488,322

One producer pinned to its own core sending to two consumers, each pinned to the same core through hyper-threading:

Results: Iterations: 20 | Avg Time: 217.354 millis | Min Time: 216.239 millis | Max Time: 218.286 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 217,353,789 nanos
Messages per second: 46,007,939

One producer pinned to its own core sending to four consumers each two using one core through hyper-threading:

Results: Iterations: 20 | Avg Time: 225.742 millis | Min Time: 224.252 millis | Max Time: 228.717 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10,000,000 messages per pass in 20 passes: 225,742,165 nanos
Messages per second: 44,298,325

Conclusions

CoralQueue can multicast messages to a set of consumers at an approximate rate of 45 million messages per second. It is very simple to use the Splitter to broadcast messages from a single producer to multiple consumers.

Inter-Process Communication with CoralQueue

cb — Thu, 30 Apr 2015 17:19:30 +0000

CoralQueue is great for inter-thread communication, when both threads are running in the same JVM. However it also supports inter-process communication (IPC) through a shared memory mapped file so that two threads running on the same machine but on different JVMs can exchange messages. This is much faster than the other option which would be network access through loopback. In this article we will examine how this can be easily done and present the benchmark numbers for IPC.

NOTE: The memory-mapped file is just used here as an in-memory bridge between the two JVMs. It is not used for any kind of file persistency. If you want to log huge amounts of data without producing any garbage and with ultra-low-latency you can refer to CoralLog.

Serializing the Transfer Object to Memory

In order to transfer your Java object with all its data from one JVM to another, you must make it implement the MemorySerializable interface. The purpose of this interface is to define how the object contents will be written to and read from memory as a sequence of primitives. The interface is listed below:

package com.coralblocks.coralbits.mem;

public interface MemorySerializable {
	
	public void writeTo(long pointer, Memory memory);
	
	public void readFrom(long pointer, Memory memory);
	
}

For example, to serialize the MutableLong class is easy:

// com.coralblocks.coralqueue.util.MutableLong implements MemorySerializable

	@Override
    public void writeTo(long pointer, Memory memory) {
		// write the value of this MutableLong object to memory
		memory.putLong(pointer, get()); // get() is the method from MutableLong that returns its value
	}

	@Override
    public void readFrom(long pointer, Memory memory) {
		// read the value of this MutableLong object from memory
		set(memory.getLong(pointer)); // set() is the method from MutableLong to change its value
    }

The interface Memory has putXXX and getXXX methods for each primitive as you can see below:

package com.coralblocks.coralbits.mem;

public interface Memory {
	
	public long getLong(long address);

	public void putLong(long address, long value);
	
	public int getInt(long address);
	
	public void putInt(long address, int value);
	
	public byte getByte(long address);
	
	public void putByte(long address, byte value);
	
	public short getShort(long address);
	
	public void putShort(long address, short value);

	public char getChar(long address);
	
	public void putChar(long address, char value);
}

Below a couple of more examples of how to serialize an object to memory:

	public class Message1 implements MemorySerializable {
		
		private static final int LEN = 32;
		public static final int MAX_SIZE_IN_BYTES = LEN;
		
		private final StringBuilder sb = new StringBuilder(LEN);
		
		@Override
        public void writeTo(long pointer, Memory memory) {
			int sbLength = sb.length();
			for(int i = 0; i < LEN; i++) {
				char c = ' '; // pad with blank
				if (i < sbLength) c = sb.charAt(i);
				// putChar writes 1 byte
				memory.putChar(pointer + i, c);
			}
        }

		@Override
        public void readFrom(long pointer, Memory memory) {
			sb.setLength(0);
			for(int i = 0; i < LEN; i++) {
				// getChar reads 1 byte...
				sb.append(memory.getChar(pointer + i));
			}
        }
		
		// (...)
	}
	
	public class Message2 implements MemorySerializable {
		
		private static final int SIZE_OF_LONG_IN_BYTES = 8;
		private static final int LEN = 32;
		public static final int MAX_SIZE_IN_BYTES = LEN * SIZE_OF_LONG_IN_BYTES;
		
		private final long[] data = new long[LEN];
		
		@Override
        public void writeTo(long pointer, Memory memory) {
			for(int i = 0; i < LEN; i++) {
				// putLong writes 8 bytes...
				memory.putLong(pointer + (i * SIZE_OF_LONG_IN_BYTES), data[i]);
			}
        }

		@Override
        public void readFrom(long pointer, Memory memory) {
			for(int i = 0; i < LEN; i++) {
				// getLong reads 8 bytes...
				data[i] = memory.getLong(pointer + (i * SIZE_OF_LONG_IN_BYTES));
			}
        }
		
		// (...)
	}

Writing the Producer

It follows the same easy API and design patterns of CoralQueue. Below a producer example:

package com.coralblocks.coralqueue.test;

import com.coralblocks.coralbits.util.SystemUtils;
import com.coralblocks.coralqueue.offheap.OffHeapProducer;
import com.coralblocks.coralqueue.util.MutableLong;

public class TestOffHeapProducer {
	
	public static void main(String[] args) throws Exception {
		
		final int messages = args.length > 0 ? Integer.parseInt(args[0]) : 10000000;
		
		final int queueCapacity = SystemUtils.getInt("capacity", 1024);
		
		final int maxObjectSize = 8; // a mutable long of course will have a max size of 8 bytes
		
		final OffHeapProducer producer = 
			new OffHeapProducer(queueCapacity, maxObjectSize, MutableLong.class, "testIPC.mmap");

		long time = System.currentTimeMillis();
		
		for(int i = 1; i <= messages; i++) {
			MutableLong ml;
			while((ml = producer.nextToDispatch()) == null); // busy spin... (no wait strategy)
			ml.set(i);
			producer.flush();
		}
		
		time = System.currentTimeMillis() - time;
		
		System.out.println("Number of messages: " + messages);
		System.out.println("Total time: " + time);
	}
}

NOTE: You must correctly specify the max object size that will be transferred so that a proper queue size can be calculated. Since a MutableLong only has one long as its data, the max size is of course 8 bytes (i.e. the size of a Java long)

Writing the Consumer

Again it follows the same easy API and design patterns of CoralQueue. Below a consumer example:

package com.coralblocks.coralqueue.test;

import com.coralblocks.coralbits.util.SystemUtils;
import com.coralblocks.coralqueue.offheap.OffHeapConsumer;
import com.coralblocks.coralqueue.util.MutableLong;

public class TestOffHeapConsumer {

	public static void main(String[] args) throws Exception {
		
		final int messages = args.length > 0 ? Integer.parseInt(args[0]) : 10000000;
		
		final int queueCapacity = SystemUtils.getInt("capacity", 1024);
		
		final int maxObjectSize = 8; // a mutable long of course will have a max size of 8 bytes
		
		final OffHeapConsumer consumer = 
			new OffHeapConsumer(queueCapacity, maxObjectSize, MutableLong.class, "testIPC.mmap");
		
		long expectedSeq = 1;
		
		int count = 0;
		
		boolean running = true;
		
		long time = System.currentTimeMillis();

		while(running) {
			
			long x = consumer.availableToPoll();
			
			if (x > 0) {
				
				for(int i = 0; i < x; i++) {
				
					 MutableLong ml = consumer.poll();
					 
					 long seq = ml.get();
					 
					 if (seq == expectedSeq) {
						expectedSeq++;
					 } else {
						throw new IllegalStateException("Got bad sequence! expected="+ expectedSeq + " received=" + seq);
					 }
					 
					 if (++count == messages) {
						 running = false;
						 break;
					 }
				}
				
				consumer.donePolling();
				
			} else {
				// busy spin... (no wait strategy)
			}
		}
		
		time = System.currentTimeMillis() - time;
		
		System.out.println("Number of messages: " + count);
		System.out.println("Total time: " + time);
		
	}
}

Latency Numbers

IPC is very fast, not faster than inter-thread communication inside the same JVM (around 53 nanos) but much faster than network access (around 2.15 micros). Below we present the message latencies for CoralQueue’s IPC. The machine used for the benchmarks was an Intel i7 quad-core (4 x 3.50GHz) Ubuntu box overclocked to 4.50Ghz.

Messages: 1,350,000 (8-byte size)
Avg Time: 97.64 nanos
Min Time: 61.0 nanos
Max Time: 3.922 micros
75% = [avg: 95.0 nanos, max: 103.0 nanos]
90% = [avg: 96.0 nanos, max: 105.0 nanos]
99% = [avg: 97.0 nanos, max: 110.0 nanos]
99.9% = [avg: 97.0 nanos, max: 163.0 nanos]
99.99% = [avg: 97.0 nanos, max: 239.0 nanos]
99.999% = [avg: 97.0 nanos, max: 1.597 micros]

Conclusion

Inter-process communication offers an in-between solution for messaging. At one extreme you have inter-thread communication inside the same JVM and at the other you have network access for processes which are running in different machines. IPC is a viable solution when for some reason you can’t run both threads inside the same JVM and you are not willing to afford the extra cost of network access. CoralQueue offers IPC through a very easy and straightforward API, following the same design principles of CoralQueue’s AtomicQueue.

Thread Concurrency vs Network Asynchronicity

cb — Mon, 09 Feb 2015 19:09:24 +0000

In this article we study two different ways of handling client requests that involve a blocking operation: multithreading programming through concurrent queues and asynchronous network calls through distributed systems.

The Problem

We have clients connected to a HTTP server (or any TCP server) sending requests that require a heavy computation, in other words, each request needs to execute some code that can take an arbitrary amount of time to complete. If we isolate this time-consuming code in a function, we can then call this function a blocking call. Simple examples would be a function that queries a database or a function that manipulates a large image file.

In the old model where one connection would be handled by its own dedicated thread, there would be no problem. But in the new reactor model where a single thread will be handling thousands of connections, all it takes is a single connection executing a blocking call to impact and block all other connections. When you have a single-threaded system, the worst thing that can happen is blocking your critical thread. How do we solve this problem without reverting back to the old one-thread-per-connection model?

Solution #1: Thread Concurrency

The first solution is described in detail in this article. You basically use CoralQueue to distribute the requests’ work (not the requests themselves) to a fixed number of threads that will execute them concurrently (i.e. in parallel). Let’s say you have 1000 simultaneous connections. Instead of having 1000 simultaneous threads (i.e. the impractical one-thread-per-connection model) you can analyze how many available CPU cores your machine has and choose a much smaller number of threads, let’s say 4. This architecture will give you the following advantages:

The critical reactor thread handling the http server requests will never block because the work necessary for each request will be simply added to a queue, freeing the reactor thread to handle additional incoming http requests.
Even if a thread or two get a request that takes a long time to complete, the other threads can continue to drain the requests sitting on the queue.

If you can guess in advance which requests will take a long time to execute, you can even partition the queue in lanes and have a fast-track lane for high-priority / fast requests, so they always find a free thread to execute.

Solution #2: Distributed Systems

Instead of doing everything on a single machine, with limited CPU cores, you can use a distributed system architecture and take advantage of asynchronous network calls. That simplifies the http server handling the requests, which now does not need any additional threads and concurrent queues. It can do everything on a single, non-blocking reactor thread. It works like this:

Instead of doing the heavy computation on the http server itself, you can move this task to another machine (i.e. node).
Instead of distributing work across threads using CoralQueue, you can simply make an asynchronous network call and pass the work to another node responsible for the heavy computation task.
The http server will asynchronously wait for the response from the heavy computation node. The response can take as long as necessary to arrive through the network because the http server will never block.
The http server can use only one thread to handle incoming http connection from external clients and outgoing tcp connections to the internal nodes doing the heavy computation work.
And the beauty of it is that you can scale by simply adding/removing nodes as necessary. Dynamic load balancing becomes trivial.
Failover is not that hard either: If one node fails, the clients waiting on that node can re-send their work to another node.

Now you might ask: How do we implement the architecture for this new node responsible for the heavy computation work? Aren’t we just transferring the problem from one machine to another? Yes but with one important difference: now you can add and remove nodes dynamically, as needed. Before you were stuck with the number of available CPU cores in your single machine. It is also important to note that the http server does not care or need to know how the nodes will choose to implement the heavy computation task. All it needs to do is send the asynchronous requests. As far as the http server is concerned, the heavy computation node can use the best or the worst architecture to do its job. The server will make a request and wait asynchronously for the answer.

An Example

Let’s say we have an http server that receives requests from clients for stock prices. The way it knows the price of a stock is by making an http request to GoogleFinance to discover the price. If making a request to Google is a blocking call (and it is because how can you know in advance how long it is going to take to get a response?) we can use Solution #1. Requests will be distributed across threads that will process them in parallel, blocking if necessary to wait for Google to respond with a price. But wait a minute, why can’t we just treat Google as a separate node in our distributed system and make an asynchronous call to its http servers? That’s Solution #2 and the code bellow shows how it can be implemented:

/*
* Copyright (c) CoralBlocks LLC (c) 2017
 */
package com.coralblocks.coralreactor.client.bench.google;

import java.nio.ByteBuffer;
import java.util.Iterator;

import com.coralblocks.coralbits.ds.IdentityMap;
import com.coralblocks.coralbits.ds.PooledLinkedList;
import com.coralblocks.coralbits.util.ByteBufferUtils;
import com.coralblocks.coralreactor.client.Client;
import com.coralblocks.coralreactor.nio.NioReactor;
import com.coralblocks.coralreactor.server.Server;
import com.coralblocks.coralreactor.server.http.HttpServer;
import com.coralblocks.coralreactor.util.Configuration;
import com.coralblocks.coralreactor.util.MapConfiguration;

public class AsyncHttpServer extends HttpServer implements GoogleFinanceListener {
	
	public class AsyncHttpAttachment extends HttpAttachment {
		// store the symbol requested by each client so we can re-send during failover...
		StringBuilder symbol = new StringBuilder(32);
		
		@Override
		public void reset(long clientId, Client client) {
			super.reset(clientId, client);
			symbol.setLength(0); // start with a fresh empty one...
		}
	}
	
	// number of http clients used to connect to google
	private final int connectionsToGoogle; 
	
	// the clients used to connect to google
	private final GoogleFinanceClient[] googleClients; 
	
	// a list of clients waiting for responses from google (for each google http connection)
	private final IdentityMap> pendingRequests; 

	private final StringBuilder symbol = new StringBuilder(32);
	private final StringBuilder price = new StringBuilder(32);

	public AsyncHttpServer(NioReactor nio, int port, Configuration config) {
	    super(nio, port, config);
	    this.connectionsToGoogle = config.getInt("connectionsToGoogle");
	    this.googleClients  = new GoogleFinanceClient[connectionsToGoogle];
	    this.pendingRequests = new IdentityMap>(connectionsToGoogle);
	    
	    MapConfiguration googleFinanceConfig = new MapConfiguration();
	    googleFinanceConfig.add("readBufferSize", 512 * 1024); // the html page returned is big...
	    
	    for(int i = 0; i < googleClients.length; i++) {
	    	googleClients[i] = new GoogleFinanceClient(nio, "www.google.com", 80, googleFinanceConfig);
	    	googleClients[i].addListener(this);
	    	googleClients[i].open();
	    	pendingRequests.put(googleClients[i], new PooledLinkedList());
	    }
    }
	
	@Override
	protected Attachment createAttachment() {
		return new AsyncHttpAttachment(); // let's use our attachment
	}
	
	private CharSequence parseSymbolFromClientRequest(ByteBuffer request) {
		// for simplicity we assume that the symbol is the request
		// Ex: GET /GOOG HTTP/1.1 => the symbol is GOOG
		
		int pos = ByteBufferUtils.positionOf(request, '/');
		
		if (pos == -1) return null;
		
		request.position(pos + 1);
		
		pos = ByteBufferUtils.positionOf(request, ' ');
		
		if (pos == -1) return null;
		
		request.limit(pos);
		
		symbol.setLength(0);
		ByteBufferUtils.parseString(request, symbol); // read from ByteBuffer to StringBuilder
		
		return symbol;
	}
	
	private GoogleFinanceClient chooseGoogleClient(long clientId) {
		// try as much as you can to get a google client...
		// that's because some connections might be dead
		for(int i = 0; i < connectionsToGoogle; i++) {
			int index = (int) ((clientId + i) % connectionsToGoogle);
			GoogleFinanceClient googleClient = googleClients[index];
			if (googleClient.isConnectionOpen()) return googleClient;
		}
		return null;
	}
	
	@Override
	protected void handleMessage(Client client, ByteBuffer msg) {
		
		AsyncHttpAttachment a = (AsyncHttpAttachment) getAttachment(client);
		
		ByteBuffer request = a.getRequest();
		
		CharSequence symbol = parseSymbolFromClientRequest(request);
		
		if (symbol == null) {
			System.err.println("Bad request from client: " + client);
			return;
		}
		
		a.symbol.setLength(0);
		a.symbol.append(symbol);

		sendToGoogle(client, symbol);
	}
	
	private void sendToGoogle(Client client, CharSequence symbol) {
		
		long clientId = getClientId(client);
		
		// distribute requests across our Google http clients...
		GoogleFinanceClient googleClient = chooseGoogleClient(clientId);

		if (googleClient == null) {
			System.err.println("It looks like all google clients are dead! Dropping request from client: " + client);
			return;
		}
		
		// send the request to google (it fully supports http pipelining)
		googleClient.sendPriceRequest(symbol);
		
		// add this client to the line of clients waiting for a response from the google http client
		pendingRequests.get(googleClient).add(client);
	}
	
	@Override // from GoogleFinanceListener interface
    public void onSymbolPrice(GoogleFinanceClient googleClient, CharSequence symbol, ByteBuffer priceBuffer) {
		
		// Got a response from google, respond to the client waiting for the price...
		
		PooledLinkedList clients = pendingRequests.get(googleClient);
		Client client = clients.removeFirst();
		
		price.setLength(0);
		ByteBufferUtils.parseString(priceBuffer, price);
		
		CharSequence response = getHttpResponse(price);
		client.send(response);
    }

	@Override // from GoogleFinanceListener interface
    public void onConnectionOpened(GoogleFinanceClient client) {
		// NOOP
    }

	@Override // from GoogleFinanceListener interface
    public void onConnectionTerminated(GoogleFinanceClient googleClient) {
		
		// Our connection to google was broken...
		// failover all clients waiting on this google connection by re-sending them to another google connection

		PooledLinkedList clients = pendingRequests.get(googleClient);
		Iterator iter = clients.iterator();
		while(iter.hasNext()) {
			Client c = iter.next();
			AsyncHttpAttachment a = (AsyncHttpAttachment) getAttachment(c);
			if (a.symbol.length() > 0) {
				sendToGoogle(c, a.symbol); // re-send
			}
		}
		clients.clear();
    }
	
	public static void main(String[] args) {
		
		int connectionsToGoogle = Integer.parseInt(args[0]);
		int port = Integer.parseInt(args[1]);
		
		NioReactor nio = NioReactor.create();
		MapConfiguration config = new MapConfiguration();
		config.add("connectionsToGoogle", connectionsToGoogle);
		Server server = new AsyncHttpServer(nio, port, config);
		server.open();
		nio.start();
	}
}

The advantages of the code above are:

It is small and simple.
It only uses one thread, the critical reactor thread, for all network activity.
There is no multithreading programming, there is no blocking and there is no concurrent queues.
It distributes the load across a set of connections to GoogleFinance (load balance).
If one connection to GoogleFinance fails, it re-sends the pending requests on that connection to other connections (failover).
You can scale the front-end to support a larger number of simultaneous clients and decrease latency by launching more http servers pinned to other cpu cores.
You can scale the back-end to increase throughput by adding more connections to GoogleFinance (i.e. connectionsToGoogle above).

Asynchronous Messages

If you start to enjoy the idea of distributed systems, the next step is to dive into the world of true distributed systems based on asynchronous messages. Instead of making asynchronous network requests to a single node, messages are sent to the distributed system so any node can take action if necessary. And because asynchronous messages are usually implemented through a reliable UDP protocol, you are able to build a truly distributed system that provides: parallelism (nodes can truly run in parallel); tight integration (all nodes see the same messages in the same order); decoupling (nodes can evolve independently); failover/redundancy (when a node fails, another one can be running and building state to take over immediately); scalability/load balancing (just add more nodes); elasticity (nodes can lag during activity peaks without affecting the system as a whole); and resiliency (nodes can fail / stop working without taking the whole system down). For more information about how asynchronous messaging middlewares work you can check CoralSequencer.

Conclusion

Every system will eventually have to perform some kind of action that requires an arbitrary amount of time to complete. In the past, pure multithreading applications became very popular, but the one-thread-per-request model does not scale. By using concurrent queues you can make a multithreaded system without all the multithreading complexity and best of all it can easily scale to thousands of simultaneous connections. But there is also an alternative solution: distributed systems where instead of using an in-memory concurrent queue to distribute work across threads you use the network to distribute work across nodes, making asynchronous network calls to these nodes. The next architectural step is to use an asynchronous messaging middleware (MQ) instead of network requests to design distributed systems that are not only easy to scale but are also loosely coupled providing parallelism, tight integration, failover, redundancy, load balancing, elasticity and resiliency.

Architecture Case Study #1: CoralReactor + CoralQueue

cb — Fri, 23 Jan 2015 00:14:20 +0000

You need a high throughput application capable of handling thousands of client connections simultaneously but some client requests might take long to process for whatever reason. How can that be done in an efficient way without impacting other connected clients and without leaving the application unresponsive for new client connections?

Solution

To handle thousands of connections an application must use non-blocking sockets over a single selector, which means the same thread will handle thousands of connections simultaneously. Problem is, if one of these connections lags for whatever reason, all other ones and the application as a whole must not be affected. In the past this problem was solved with the infamous one-thread-per-client approach which does not scale and leads to all kinds of multithreading pitfalls like race conditions, visibility issues and deadlocks. By using one thread for the selector and a fixed number of threads for the heavy-duty work, a system can solve this problem by distributing client work (and not client requests) among the heavy-duty threads without affecting the overall performance of the application. But how does this communication between the selector thread and the heavy-duty threads happen? Through CoralQueue demultiplexers and multiplexers.

Diagram

Flow

CoralReactor running on single thread pinned to an isolated cpu core with CoralThreads.
CoralReactor opens one or more servers listening on a local port. All servers are running on the same reactor thread.
A server can receive one or thousands of connections from many clients across the globe.
Each client sends requests with some work to be performed.
The server does not perform this work. Instead it passes a message describing the work to a heavy-duty thread using a CoralQueue demultiplexer.
The CoralQueue demux distributes the messages among the heavy-duty threads.
The heavy-duty threads are also pinned to an isolated cpu core with CoralThreads.
A heavy-duty thread executes the work and sends back a message with the results to the server using a CoralQueue multiplexer.
The server picks up the message from the CoralQueue mux and reports back the results to the client.

FAQ

Won’t you have to create garbage when passing messages back and forth among threads?
A: No. CoralQueues is a ultra-low-latency, lock-free data-structure for inter-thread communication that does not produce any garbage.
What happens if the queue gets full?
A: A full queue will cause the reactor thread to block waiting for space. This creates latencies. To avoid a full queue you can start by increasing the number of heavy-duty threads and/or increasing the size of the queue.
I did number 2 above but I am still getting a full queue. Now what?
A: CoralQueue has the built-in feature to write messages to disk asynchronously when it hits a full queue so it does not have to block waiting for space. Then the heavy-duty threads can get the messages from the queue file when they don’t find them in memory. You can use this approach not to disturb the reactor thread but at this point it is probably a good idea to also try to make whatever work your heavy-duty threads are performing more efficient.
How many connections can the application handle?
A: CoralReactor can easily handle 10k+ connections concurrently in a single thread. If your machine has additional cores, you can also add more reactor threads to increase this number even more.
How many heavy-duty threads should I have?
A: That depends on the number of available cpu cores that your machine has. A cpu core is a scarce resource so you should allocate them across your applications wisely. Creating more threads than the number of available cpu cores won’t bring any benefit and it will actually degrade the performance of the system due to context switches. Ideally you should have a fixed number of heavy-duty threads pinned to their own isolated core so they are never interrupted.

Variations

Instead of using one CoralQueue demultiplexer to randomly distribute messages across all heavy-duty threads, you can introduce the concept of lanes, with each lane having a heaviness number attached to it. For example, heavy tasks go all to lane 1, not-so-heavy tasks go to lane 2 and light tasks go to lane 3. The application would then decide in which lane a message should be dispatched. If a lane will be processed by a single heavy-duty thread, it can use a regular one-producer-to-one-consumer CoralQueue queue. If a lane will be served by 2 or more heavy-duty threads, then it can use a CoralQueue demultiplexer. To report back the results to the server, all heavy-duty threads can continue to use a CoralQueue multiplexer.

Code Example

Below you can see a simple server illustrating the architecture described above. To keep it simple it receives a string (the request) and returns the string prepended by its length (the response). It supports many clients and distribute the work among worker threads using a demux. Then it uses a mux to collect the results from the worker threads and respond to the appropriate client. In a more realistic scenario, the worker threads would be doing some heavier work, like accessing a database. You can easily test this server by connecting through a telnet client.

package com.coralblocks.coralreactor.client.bench.queued;

import java.nio.ByteBuffer;

import com.coralblocks.coralbits.util.Builder;
import com.coralblocks.coralbits.util.ByteBufferUtils;
import com.coralblocks.coralqueue.demux.AtomicDemux;
import com.coralblocks.coralqueue.demux.Demux;
import com.coralblocks.coralqueue.mux.AtomicMux;
import com.coralblocks.coralqueue.mux.Mux;
import com.coralblocks.coralreactor.client.Client;
import com.coralblocks.coralreactor.nio.NioReactor;
import com.coralblocks.coralreactor.server.AbstractLineTcpServer;
import com.coralblocks.coralreactor.server.Server;
import com.coralblocks.coralreactor.util.Configuration;
import com.coralblocks.coralreactor.util.MapConfiguration;

public class QueuedTcpServer extends AbstractLineTcpServer {
	
	static class WorkerRequestMessage {
		
		long clientId;
		ByteBuffer buffer;
		
		WorkerRequestMessage(int maxRequestLength) {
			this.clientId = -1;
			this.buffer = ByteBuffer.allocateDirect(maxRequestLength);
		}
		
		void readFrom(ByteBuffer src) {
			buffer.clear();
			buffer.put(src);
			buffer.flip();
		}
	}
	
	static class WorkerResponseMessage {
		
		long clientId;
		ByteBuffer buffer;
		
		WorkerResponseMessage(int maxResponseLength) {
			this.clientId = -1;
			this.buffer = ByteBuffer.allocateDirect(maxResponseLength);
		}
	}
	
	private final int numberOfWorkerThreads;
	private final Demux demux;
	private final Mux mux;
	private final WorkerThread[] workerThreads;

	public QueuedTcpServer(NioReactor nio, int port, Configuration config) {
	    super(nio, port, config);
	    this.numberOfWorkerThreads = config.getInt("numberOfWorkerThreads");
	    final int maxRequestLength = config.getInt("maxRequestLength", 256);
	    final int maxResponseLength = config.getInt("maxResponseLength", 256);
	    
	    Builder requestBuilder = new Builder() {
			@Override
            public WorkerRequestMessage newInstance() {
	            return new WorkerRequestMessage(maxRequestLength);
            }
	    };
	    
	    this.demux = new AtomicDemux(1024, requestBuilder, numberOfWorkerThreads);
	    
	    Builder responseBuilder = new Builder() {
	    	@Override
            public WorkerResponseMessage newInstance() {
	            return new WorkerResponseMessage(maxResponseLength);
            }
	    };
	    
	    this.mux = new AtomicMux(1024, responseBuilder, numberOfWorkerThreads);
	    
	    this.workerThreads = new WorkerThread[numberOfWorkerThreads];
    }
	
	@Override
	public void open() {
		
		for(int i = 0; i < numberOfWorkerThreads; i++) {
			if (workerThreads[i] != null) {
				try {
					// make sure it is dead!
					workerThreads[i].stopMe();
					workerThreads[i].join();
				} catch(Exception e) {
					throw new RuntimeException(e);
				}
			}
		}
		
		mux.clear();
		demux.clear();
			
		for(int i = 0; i < numberOfWorkerThreads; i++) {
			workerThreads[i] = new WorkerThread(i);
			workerThreads[i].start();
		}
		
		nio.addCallback(this); // we want to constantly receive callbacks from 
							   // reactor thread on handleCallback() to drain responses from mux
		
		super.open();
	}
	
	@Override
	public void close() {
		
		for(int i = 0; i < numberOfWorkerThreads; i++) {
			if (workerThreads[i] != null) {
				workerThreads[i].stopMe();
			}
		}
		
		nio.removeCallback(this);
		
		super.close();
	}
	
	@Override
	protected void handleMessage(Client client, ByteBuffer msg) {
		
		if (ByteBufferUtils.equals(msg, "bye") || ByteBufferUtils.equals(msg, "exit")) {
			client.close();
			return;
		}
		
		// on a new message, dispatch to the demux so worker threads can process it:
		
		WorkerRequestMessage req;
		
		while((req = demux.nextToDispatch()) == null); // busy spin...
		
		req.clientId = getClientId(client);
		req.readFrom(msg);
		
		demux.flush();
	}
	
	class WorkerThread extends Thread {
		
		private final int index;
		private volatile boolean running = true;
		
		public WorkerThread(int index) {
			super("WorkerThread-" + index);
			this.index = index;
		}
		
		public void stopMe() {
			running = false;
		}
		
		@Override
        public void run() {
            
			while(running) {
			
    			// read from demux and process:
    			
    			long avail = demux.availableToPoll(index);
    			
    			if (avail > 0) {
    				
    				for(int i = 0; i < avail; i++) {
    					
    					// get the request:
    					WorkerRequestMessage req = demux.poll(index);
    					
    					// do something heavy with the request, like accessing database or big data...
    					// for our example we just prepend the message length
    					
    					long clientId = req.clientId;
    					int msgLen = req.buffer.remaining();
    					
    					// get a response object from mux:
    
    					WorkerResponseMessage res = null;
    					
    					while((res = mux.nextToDispatch(index)) == null); // busy spin
    
    					// notice below that we are just copying data from request to response:
    					res.clientId = clientId; // copy clientId
    					res.buffer.clear();
    					ByteBufferUtils.appendInt(res.buffer, msgLen);
    					res.buffer.put((byte) ':');
    					res.buffer.put((byte) ' ');
    					res.buffer.put(req.buffer); // copy buffer contents
    					res.buffer.flip(); // don't  forget
    				}
    				
					mux.flush(index);
    				demux.donePolling(index);
    				nio.wakeup(); // don't forget so handleCallback is called
    			}
			}
        }
	}
	
	@Override
	protected void handleCallback(long nowInMillis) {
		
		// this is the reactor thread calling us back to check whether the mux has pending results:
		
		long avail = mux.availableToPoll();
		
		if (avail > 0) {
			
			for(long i = 0; i < avail; i++) {
				
				WorkerResponseMessage res = mux.poll();
				
				Client client = getClient(res.clientId);
				
				if (client != null) { // client might have disconnected...
					client.send(res.buffer);
				}
			}
			
			mux.donePolling();
		}
	}
	
	public static void main(String[] args) {
		
		NioReactor nio = NioReactor.create();
		MapConfiguration config = new MapConfiguration();
		config.add("numberOfWorkerThreads", 4);
		Server server = new QueuedTcpServer(nio, 45451, config);
		server.open();
		nio.start();
		
	}
	
}

CoralQueue Throughput Test Explained

cb — Mon, 16 Jun 2014 21:39:21 +0000

In this article we will present the benchmark test used by CoralQueue that shows a throughput between 55 and 65 million messages per second without hyper-threading and between 85 and 95 million messages per second with hyper-threading. If you are interested in the CoralQueue Getting Started article, you can check it here first.

Test Mechanics

To calculate throughput we run 20 different passes of 10 million messages each. Then the average time of these 20 passes are calculated to reach the average ops (operations per second) number. Note that a pass only ends after the consumer has received all messages sent by the producer. In order to receive feedback from the consumer, the producer uses an AtomicInteger so it can be notified when the consumer has processed all the messages. Then the producer proceeds to the next pass.

The test flow is described below:

The producer sends 10 million messages to the consumer through the queue. Once it is done sending, it blocks on the AtomicInteger waiting for an acknowledgment from the consumer that it has received and processed all the messages.
The producer proceeds to the next pass and the cycle repeats.
Once the producer has executed 20 passes it sends a final message to the consumer to signal that we are done and the consumer can now die.
The results are then presented.
Note that we ignore the first 4 passes as warmup passes.
Note that the pass total time is calculated in the consumer side when it receives and processes the last message of the pass.

Test Source Code

package com.coralblocks.coralqueue.bench;

import java.util.concurrent.atomic.AtomicInteger;

import com.coralblocks.coralbits.MutableLong;
import com.coralblocks.coralbits.bench.Benchmarker;
import com.coralblocks.coralbits.util.SystemUtils;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.waitstrategy.ParkWaitStrategy;
import com.coralblocks.coralqueue.waitstrategy.WaitStrategy;
import com.coralblocks.coralthreads.Affinity;

public class Throughput {
	
	public static void main(String[] args) throws InterruptedException {
		
		final int messagesToSend = 10000000;
		final int queueSize = 1024;
		final int warmupPasses = 4;
		final int passes = 20;
		
		final int prodProcToBind = SystemUtils.getInt("producerProcToBind", -1);
		final int consProcToBind = SystemUtils.getInt("consumerProcToBind", -1);
		final boolean flushLazySet = SystemUtils.getBoolean("flushLazySet", true);
		final boolean donePollingLazySet = SystemUtils.getBoolean("donePollingLazySet", false);
		
		final Queue queue = new AtomicQueue(queueSize, MutableLong.class);
		
		final AtomicInteger countdown = new AtomicInteger();
		
		final WaitStrategy waitStrategy = new ParkWaitStrategy(true); // true => back off
		
		final Benchmarker bench = Benchmarker.create(warmupPasses);		
		
		Thread producer = new Thread(new Runnable() {
			
			@Override
			public void run() {
				
				Affinity.bind();
				
				MutableLong ml = null;
				
				for(int i = 0; i < passes + warmupPasses; i++) {

					long count = 0;
					
					countdown.set(1);
					
					bench.mark();
				
    				while(count < messagesToSend) {
    					while((ml = queue.nextToDispatch()) == null); // busy spin
    					ml.set(count++);
    					// we are not batching here so flush() is called many times...
    					// therefore it is much better to use lazySet here...
    					// change it to false and you will see the difference
    					queue.flush(flushLazySet);
    				}
    				
    				while(countdown.get() != 0) { // wait for consumer to finish...
    					waitStrategy.block();
    				}
    				
    				waitStrategy.reset();
				}
				
				// send the very last message signaling that we are done!
				while((ml = queue.nextToDispatch()) == null);
				ml.set(-1);
				queue.flush();
				
				Affinity.unbind();
				
				System.out.println("producer exiting...");				
			}
		}, "Producer");
		
		Thread consumer = new Thread(new Runnable() {

			@Override
			public void run() {
				
				Affinity.bind();
				
				boolean running = true;
				
				int pass = 0;
				
				while (running) {
					
					long avail;
					while((avail = queue.availableToPoll()) == 0); // busy spin
					for(int i = 0; i < avail; i++) {
						MutableLong ml = queue.poll();
						long x = ml.get();
						if (x == -1) {
							// the last message sent by the producer to indicate that we should die
							running = false;
						} else if (x == messagesToSend - 1) {
							// the last message of a pass... print some results and notify the producer...
							long t = bench.measure();
							System.out.println("Pass " + pass + "... " + (pass < warmupPasses ? "(warmup)" : "(" + Benchmarker.convertNanoTime(t) + ")"));
							pass++;

							countdown.decrementAndGet(); // let the producer know!
						}
					}
					// we are batching in the consumer side so let the producer
					// know asap that it can send more messages to the queue
					// therefore we do NOT use lazySet here...
					// using lazySet here decreases throughput but not much
					// change it to see the difference
					queue.donePolling(donePollingLazySet);
				}
				
				Affinity.unbind();
				
				System.out.println("consumer exiting...");
			}
		}, "Consumer");
		
		if (Affinity.isAvailable()) {
			Affinity.assignToProcessor(prodProcToBind, producer);
			Affinity.assignToProcessor(consProcToBind, consumer);
		} else {
			System.err.println("Thread affinity not available!");
		}
		
		consumer.start();
		producer.start();
		
		consumer.join();
		producer.join();
		
		long time = Math.round(bench.getAverage());
		
		long mps = messagesToSend * 1000000000L / time;
		
		System.out.println("Results: " + bench.results());
		System.out.println("Average time to send " + messagesToSend + " messages per pass in " + passes + " passes: " + time + " nanos");
		System.out.println("Messages per second: " + mps);
	}
}

Test Results

The machine used to run the benchmark tests was an Intel i7 quad-core (4 x 3.50GHz) Ubuntu box overclocked to 4.50Ghz.

Results without hyper-threading:

$ java -server -verbose:gc -cp target/coralqueue-all.jar -Xms2g -Xmx8g -XX:NewSize=512m -XX:MaxNewSize=1024m -DproducerProcToBind=2 -DconsumerProcToBind=3 -DexcludeNanoTimeCost=true  com.coralblocks.coralqueue.bench.Throughput
Pass 0... (warmup)
Pass 1... (warmup)
Pass 2... (warmup)
Pass 3... (warmup)
Pass 4... (168.334 millis)
Pass 5... (168.546 millis)
Pass 6... (165.904 millis)
Pass 7... (168.469 millis)
Pass 8... (158.65 millis)
Pass 9... (166.946 millis)
Pass 10... (168.114 millis)
Pass 11... (160.557 millis)
Pass 12... (163.021 millis)
Pass 13... (168.204 millis)
Pass 14... (164.229 millis)
Pass 15... (168.085 millis)
Pass 16... (164.91 millis)
Pass 17... (165.532 millis)
Pass 18... (166.758 millis)
Pass 19... (164.743 millis)
Pass 20... (163.74 millis)
Pass 21... (164.291 millis)
Pass 22... (165.269 millis)
Pass 23... (158.166 millis)
producer exiting...
consumer exiting...
Results: Iterations: 20 | Avg Time: 165.123 millis | Min Time: 158.166 millis | Max Time: 168.546 millis | Nano Timing Cost: 16.0 nanos
Average time to send 10000000 messages per pass in 20 passes: 165123448 nanos
Messages per second: 60,560,750

Results with hyper-threading:

$ java -server -verbose:gc -cp target/coralqueue-all.jar -Xms2g -Xmx8g -XX:NewSize=512m -XX:MaxNewSize=1024m -DproducerProcToBind=2 -DconsumerProcToBind=6 -DexcludeNanoTimeCost=true   com.coralblocks.coralqueue.bench.Throughput
Pass 0... (warmup)
Pass 1... (warmup)
Pass 2... (warmup)
Pass 3... (warmup)
Pass 4... (110.678 millis)
Pass 5... (110.653 millis)
Pass 6... (110.82 millis)
Pass 7... (110.67 millis)
Pass 8... (110.612 millis)
Pass 9... (110.648 millis)
Pass 10... (110.668 millis)
Pass 11... (110.727 millis)
Pass 12... (110.643 millis)
Pass 13... (110.69 millis)
Pass 14... (110.594 millis)
Pass 15... (110.654 millis)
Pass 16... (110.672 millis)
Pass 17... (110.776 millis)
Pass 18... (110.647 millis)
Pass 19... (110.724 millis)
Pass 20... (110.764 millis)
Pass 21... (110.677 millis)
Pass 22... (110.753 millis)
Pass 23... (110.645 millis)
producer exiting...
consumer exiting...
Results: Iterations: 20 | Avg Time: 110.686 millis | Min Time: 110.594 millis | Max Time: 110.82 millis | Nano Timing Cost: 14.0 nanos
Average time to send 10000000 messages per pass in 20 passes: 110685691 nanos
Messages per second: 90,345,914

Conclusion

CoralQueue can send up to 65 million messages per second without hyper-threading and up to 95 million messages per second with hyper-threading.

CoralQueue Performance Numbers

cb — Mon, 21 Apr 2014 19:22:41 +0000

In this article we present CoralQueue performance numbers for four different scenarios: message-sender latency, message transit latency, message-sender throughput and message transit throughput. The standard scenario of one producer (message-sender) and one consumer (message-receiver) is used with two possible setups: producer and consumer pinned to the same core (hyper-threading) and producer and consumer pinned to different cores (no hyper-threading).

Message-sender Latencies

In this test we measure the time it takes for the message-sender (i.e. producer) to get rid of the message it has to send, in other words, we don’t care about the time it takes for the message to hit the consumer, just the time it takes for the producer to dispatch the message. We use the AtomicLong lazySet operation to reduce even more the message-sender latency, even if that increases the message-transit latency. The benchmark source code can be seen here.

With hyper-threading:

Messages: 1,100,000
Avg Time: 14.74 nanos
Min Time: 7.0 nanos
Max Time: 5.331 micros
75% = [avg: 13.0 nanos, max: 15.0 nanos]
90% = [avg: 13.0 nanos, max: 16.0 nanos]
99% = [avg: 13.0 nanos, max: 29.0 nanos]
99.9% = [avg: 14.0 nanos, max: 304.0 nanos]
99.99% = [avg: 14.0 nanos, max: 366.0 nanos]
99.999% = [avg: 14.0 nanos, max: 610.0 nanos]

Without hyper-threading:

Messages: 1,100,000
Avg Time: 29.54 nanos
Min Time: 6.0 nanos
Max Time: 4.963 micros
75% = [avg: 26.0 nanos, max: 28.0 nanos]
90% = [avg: 26.0 nanos, max: 29.0 nanos]
99% = [avg: 28.0 nanos, max: 132.0 nanos]
99.9% = [avg: 29.0 nanos, max: 226.0 nanos]
99.99% = [avg: 29.0 nanos, max: 287.0 nanos]
99.999% = [avg: 29.0 nanos, max: 1.0 micros]

Message Transit Latencies

In this test we measure the total transit time of the message, in other words, the amount of time it takes for a message to be dispatched by the producer and received by the consumer. The AtomicLong lazySet operation is not used because we want to notify the consumer as soon as possible about new messages added by the producer. The benchmark source code can be seen here.

With hyper-threading:

Messages: 10,000,000
Avg Time: 52.97 nanos
Min Time: 32.0 nanos
Max Time: 9.052 micros
75% = [avg: 51.0 nanos, max: 56.0 nanos]
90% = [avg: 52.0 nanos, max: 58.0 nanos]
99% = [avg: 52.0 nanos, max: 61.0 nanos]
99.9% = [avg: 52.0 nanos, max: 66.0 nanos]
99.99% = [avg: 52.0 nanos, max: 287.0 nanos]
99.999% = [avg: 52.0 nanos, max: 1.27 micros]

Without hyper-threading:

Messages: 10,000,000
Avg Time: 88.18 nanos
Min Time: 64.0 nanos
Max Time: 5.961 micros
75% = [avg: 84.0 nanos, max: 94.0 nanos]
90% = [avg: 86.0 nanos, max: 98.0 nanos]
99% = [avg: 87.0 nanos, max: 109.0 nanos]
99.9% = [avg: 88.0 nanos, max: 134.0 nanos]
99.99% = [avg: 88.0 nanos, max: 236.0 nanos]
99.999% = [avg: 88.0 nanos, max: 1.198 micros]

Message-sender Throughput

In this test we send as many messages as possible on the producer side to measure how long it takes for the message-sender thread (i.e. producer) to get rid of a big batch of messages. We then calculate the message per second rate. We make 20 passes and compute the average. Again the AtomicLong lazySet operation is used to reduce the latency in the producer side. The benchmark source code can be seen here.

With hyper-threading:

Passes: 20 | Avg Time: 102.703 millis | Min Time: 102.32 millis | Max Time: 109.105 millis
Average time to send 10,000,000 messages: 102,702,560 nanos
Messages per second: 97,368,556

Without hyper-threading:

Passes: 20 | Avg Time: 139.555 millis | Min Time: 137.649 millis | Max Time: 145.542 millis
Average time to send 10,000,000 messages: 139,555,424 nanos
Messages per second: 71,656,118

Message Transit Throughput

In this test we compute what is the maximum number of messages we can send from producer to consumer in one second. We make 20 passes and compute the average. The AtomicLong lazySet operation is not used because we want to notify the consumer as soon as possible about new messages added by the producer. The benchmark source code can be seen here.

With hyper-threading:

Passes: 20 | Avg Time: 117.343 millis | Min Time: 117.273 millis | Max Time: 117.435 millis
Average time to send 10,000,000 messages: 117,342,956 nanos
Messages per second: 85,220,283

Without hyper-threading:

Passes: 20 | Avg Time: 234.781 millis | Min Time: 233.626 millis | Max Time: 236.014 millis
Average time to send 10,000,000 messages: 234,781,092 nanos
Messages per second: 42,592,867