When parallelism beats concurrency

I’m aware that for many of you these two concepts probably mean the same or maybe you’d struggle to explain the differences between them; however, they’re actually two very different concepts and this is quite important to understand the way in which we process data nowadays.

In both cases we try to solve a problem faster by increasing the number of workers dedicated to a given task, but the way in which the distribute the work to do is where their biggest difference stands.

You’ll be able to see why it’s so important to understand their differences for an efficient, safe and error-free processing of data.

Let’s start by explaining these concepts!

Introduction

To start with, let’s take a brief look at what we should be understanding as concurrency and parallelism.

Concurrency

In layman’s terms, concurrency is the situation where in order to solve a problem, we process it in a way that one single task gets processed concurrently by multiple workers; that said, let’s imagine a big array where we have multiple workers and each worker does some work on the next element to be processed in the array until we reach the end of the array.

concurrency

When concurrency takes place, some synchronisation is required in order to access the resource that gets shared among all the existing workers (in our example this is the array).
The complexity and performance overhead involved in this approach could be very significant in some cases; we’ll try to demonstrate this later on in this article.

Parallelism

On the other hand, parallelism is the situation where in order to solve a problem we decide to take a “Divide and Conquer” approach and split the problem in multiple small problems; that allows us to solve multiple smaller problems in parallel.

How would it look like using the same example we’ve shown above? Let’s imagine that we have four parallel workers, then a parallel solution would be as shown below:

parallelism

As you can see, the difference is substantial; we have now four smaller tasks that can be solved independently. Knowing that, we can affirm that the elements get processed sequentially by each worker! When each worker has completed, we can then combine their results to produce one single and final result.
The main benefit of this is that we don’t need to synchronise the access of the workers to a shared resource; now each worker has its independent chunk of work to process.

I hope the difference between these two is clear but what’s left? We have one more case, which is quite obvious: sequential processing.

In a standard sequential process we’d be processing our task in sequential order with a single worker, this is the more common approach and in some cases it could surprisingly be the fastest!

If you need to have a better understanding about concurrency and multi-threading and why we do things in the way we do them nowadays, I’d recommend that you read the book “Java Concurrency in Practice” by Brian Goetz.

How can we solve it?

If we compare these three approaches in terms of performance, we can almost guarantee that in most of the cases the concurrent approach will be the less performant by far. Why is that? As we mentioned before, synchronisation between threads to access shared resources causes a contention which affects performance hugely.

Let’s go through an example to understand the reasons of this performance impact!

Concurrent approach

The problem we’ll be showing is the addition of all the elements in a collection of integers. In the concurrent approach we’ll be using several threads to add each pair of elements to try to speed things up; so how does it look like?

First of all, we are going to create an interface to implement the solution following different approaches.

public interface Processor {
Integer process(List<Integer> input) throws InterruptedException;
}
view raw Processor.java hosted with ❤ by GitHub

Having that, our concurrent implementation would be:

package com.theboreddev.concurrency;
import com.theboreddev.Processor;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.LongAccumulator;
import java.util.stream.IntStream;
public class ConcurrentProcessor implements Processor {
private static final int THREADS = 4;
private final LongAccumulator result = new LongAccumulator(Long::sum, 0L);
private final AtomicInteger position = new AtomicInteger(0);
private final ExecutorService executor = Executors.newFixedThreadPool(THREADS);
@Override
public Integer process(List<Integer> input) throws InterruptedException {
processArrayConcurrently(input.toArray(new Integer[input.size()]));
return result.intValue();
}
private void processArrayConcurrently(Integer[] array) throws InterruptedException {
final Runnable runnable = () -> {
while (position.intValue() < array.length) {
addElements(array);
}
};
IntStream.rangeClosed(1, THREADS)
.forEach(threadNumber -> executor.submit(runnable));
executor.awaitTermination(1, TimeUnit.SECONDS);
}
private void addElements(Integer[] array) {
int current = position.getAndAdd(2);
final int sum = array[current] + array[current + 1];
result.accumulate(sum);
}
}

What has been done is basically create a fixed number of threads where each thread will be adding a pair of elements in the array each time until the end of the array is reached.

You will notice that we’ve made use of two important Java classes available in the java.util.concurrent.atomic package; we’re talking about AtomicInteger and LongAccumulator.
These two classes are very helpful to synchronise the access to the current position and the final result; however, they could also have a considerable impact over the performance of a solution. Let’s see why!

Atomic variables internals

First of all, why have we decided to use AtomicInteger to hold our current position?
Well, AtomicInteger is a good fit for those situations where a variable is going to be modified by multiple threads but more importantly, this variable is going to be read very frequently.

AtomicInteger uses a combination of “compare and swap” and volatile variables.
Compare and swap” is an improvement with respect to the use of “synchronized” blocks; however, if a thread tries to set a value that has changed, its write operation will fail and it’ll have to retry.
On top of that, the use of volatile implies that in many cases when one thread access that variable, its value has to be retrieved from main memory; this is a considerable performance impact, as access to main memory is much more expensive than access to cache levels.

cpu cache

What about LongAccumulator? This type of class is useful when several threads write to a variable but it never gets read until the operation completes. How does it work? LongAccumulator cleverly utilises a non-blocking approach by allowing multiple writes in parallel to individual counters and then they all get merged at the end of the process. Something like this:

So after understanding how our solution works, we could affirm that the main bottleneck in this implementation would be the synchronisation of the current position in the array among the different threads.

Let’s take a look now at our sequential implementation!

Sequential approach

The sequential approach is quite straightforward, we just process each element sequentially using a single thread. Let’s see how it looks like:

package com.theboreddev.sequential;
import com.theboreddev.Processor;
import java.util.List;
public class SequentialProcessor implements Processor {
@Override
public Integer process(List<Integer> input) {
return input
.stream()
.reduce(Integer::sum)
.orElse(0);
}
}

We are iterating through the elements using a Java Stream and making use of reduce method to combine the elements by adding each pair.

We could also use a standard Java loop and iteratively go through each element, keeping the sum in a variable. What would be the problem with this approach? The main problem is that it couldn’t be easily parallelised, it’d need some level of synchronisation using a concurrent approach to be able to get a valid result!

So our sequential solution is not bad, although it could be improved in those situations where adding more threads could reduce the amount of time taken to do the job. That takes us to our last approach, the parallel approach!

Parallel approach

Implementing a parallel approach used to be much more difficult in the past, but with the introduction of Java Streams since JDK 8 things have become much simpler for us.
Actually our parallel implementation will only be different in one line with respect to our sequential approach thanks to Java Streams API:

package com.theboreddev.parallelism;
import com.theboreddev.Processor;
import java.util.List;
public class ParallelProcessor implements Processor {
@Override
public Integer process(List<Integer> input) {
return input
.parallelStream()
.reduce(Integer::sum)
.orElse(0);
}
}

Very simple, right?

So do you think that a parallel execution will always be faster than a sequential execution? Probably most of you will think that definitely it will, but the truth is that it’s not that simple! Why is that?

In order to be able to have an effective parallel processing we have to consider different aspects; that means that only under certain conditions parallelising our code will actually improve the performance of our code. Let’s take a look at these aspects!

Parallel vs sequential

As we have mentioned, a parallel approach is not always the fastest solution to solve a problem. There are several aspects that we have to keep in consideration before parallelising our code:

  • Is the data big enough to make a difference?

In many cases the amount of data that we have to process is so small that a sequential execution with a single thread will be completed by the time the data is split and distributed to the threads!

  • Is the source easily splittable?

If we cannot easily split the data in an even way, this would probably cause an overhead that will make us lose all the benefits of running parallel workers when compared to a sequential execution.
For example, if your source is an Iterator you won’t be able to split it easily; however, Java has introduced Spliterator since JDK 8 to have a splittable iterator.

  • Can it be split in completely independent and isolated chunks of work?

Sometimes each step in a step is dependant on the result from the previous step, creating an interdependency that will be impossible to break. These tasks are inherently sequential and in these cases it’ll be absolutely pointless to run our task in parallel because we won’t see any benefit.

  • Is merging results cheap or expensive?

It won’t be the same having to merge integers as having to merge multiple complex trees generated by different threads; that would be very expensive. We have to always consider this, as we could get unexpected results when parallelising our code.

  • Do we have to keep the order in which we process the elements?

In those cases where we have to keep the order in which we receive the elements, our parallelisation options will be very limited.
Knowing that, we should know that Java Streams are able to do optimisations when it knows that data is unordered, as we explained in my article “Understanding Java Streams“, about how streams use flags to do optimisations.
One way to take advantage of that is using unordered() to express that we don’t care about the order; that will allow optimisations for many short-circuiting operations like findFirst(), findAny() or limit().

  • Cache misses

The way we implement our code could have a significant impact on performance; for example, the use of array-based structures favours the allocation in CPU caches. Our hardware will allocate a chunk of the array when we access the first element, because most of the times we’ll be accessing more elements in the array. This has a very beneficial impact on performance, as cache access is much faster than main memory access.

In the case of parallel executions, if we constantly get cache misses our threads will be blocked waiting for data to be served from memory; this would mean that we’d lose part of the benefit of running parallel threads.

With Java Streams parallel execution is ALMOST magic, but it’s not as simple as it looks like!

Ok, so these are some of the things that you’ll have to keep in mind when using parallel executions! Let’s see now some performance results for our implementations to see if all the theory we’ve gone through does actually make sense.

Performance comparison

In this section we’ll be showing the results of some benchmark executions against our implementations to check if that matches what we’ve studied so far.
JMH has been used to easily create a benchmark test; we have selected to run “Average time” benchmark among the different types of benchmark available in JMH.

In our first run we’re going to select a size of 1,000,000 elements and based on the results we’ll see what to do next.

First benchmark: 1,000,000 elements

The source code to understand how these tests have been run can be found here.
Basically we’ve run two warmup benchmark iterations to give enough time to JIT compiler to do optimisations and then we’ve taken the results on the third iteration.
So let’s get to the point, what were the results?

BenchmarkModeCntScoreErrorsUnits
ConcurrentPerformanceTest.testavgt51038.731±13.232ms/op
ParallelPerformanceTest.testavgt538.655±13.594ms/op
SequentialPerformanceTest.testavgt536.100±11.902ms/op


As expected, the concurrent approach was by far the worst performant; taking 1 second and 38 ms average. On the other hand parallel and sequential approach had similar results, being the sequential approach the winner by more than 2 ms!

As we mentioned earlier, it’s not that easy to write efficient parallel code; I’m thinking that one of the reasons could be that data was not big enough. Let’s run our benchmark again with 10,000,000 instead.

Second benchmark: 10,000,000 elements

The results to add 10,000,000 elements were:

BenchmarkModeCntScoreErrorsUnits
ConcurrentPerformanceTest.testavgt51506.053±285.898ms/op
ParallelPerformanceTest.testavgt5427.291±134.101ms/op
SequentialPerformanceTest.testavgt5547.281±441.518ms/op


In this case the clear winner is the parallel approach! We’ve also noticed that the advantage with respect to the concurrent approach has been reduced considerably.
So we’ve had to process as much as 10,000,000 elements to notice a significant improvement with parallel processing; however, this is just a very simple example and other aspects could be taken in consideration and further improvements could be made.

That’s it from me! I really hope you’ve learned something this time!

Conclusion

We’ve learned that although Java Streams make it really easy to switch our code to be executed in parallel, things are not always as simple as that.

The best thing to do is to always run performance tests when performance is among our business requirements, but please don’t fall into premature optimisations. If making your code faster has no real business benefit in your organisation, then don’t waste your time and efforts.

One more thing that always helps is to have a good understanding of how every layer works; from the CPU to the very top of our application’s source code. Having an understanding about the internals and the effects that this could have in performance always helps to anticipate possible issues in terms of performance.

I hope that you’ve enjoyed this reading! If you liked this article please subscribe/follow to be notified when a new article gets published.
Looking forward to seeing you again soon!

Thanks you very much for reading!

Up ↑

Take a look at our recommended books!

Ok!
X
%d bloggers like this: