After having had a deep introduction to functional programming in our last article “A new Java functional style“, we think it’s now time to look at Java Streams more in depth and understand how they work internally. This can be something very important when working with streams if our performance is going to be impacted.
You’ll be able to see how much easier and efficient is processing sequences of elements with streams compared to the “old way” of doing things and how nice is to write code using fluent interfaces.
You can now say good-bye to error-prone code, full of boilerplate code and clutter that was making our lives as developers much more complicated.
Our goal is to make of this article a kind of java streams cheat sheet, but let’s start by having a brief introduction to Java Streams first!
What are streams in Java
What are Java Streams? Java Streams are basically a pipeline of aggregate operations that can be applied to process a sequence of elements.
An aggregate operation is a higher-order function that receives a behaviour in a form of a function or lambda, and that behaviour is what gets applied to our sequence.
For example, if we define the following stream:
collection.stream() .map(element -> decorateElement(element)) .collect(toList())
In this case the behaviour what we’re applying to each element is the one specified in our “decorateElement” method, which will supposedly be creating a new “enhanced” or “decorated” element based on the existing element.
Streams are built around its main interface, the Stream interface, which was released in JDK 8. Let’s go into a bit more of detail briefly!
Characteristics of a stream
As it was mentioned in my last article, streams have these main characteristics:
- Declarative paradigm
Streams are written specifying what has to be done, but not how.
- Lazily evaluated
This basically means that until we call a terminal operation, our stream won’t be doing anything, we will just have declared what our pipeline will be doing.
- It can be consumed only once
Once we call a terminal operation, a new stream would have to be generated in order to apply the same series of java streams operations.
- Can be parallelised
Java Streams are sequential by default, but they can be very easily parallelised.
We should see Java Streams as a series of connected pipes, where in each pipe our data gets processed differently; this concept is very similar to UNIX pipes!
Phases of a stream
A Java Stream is composed by three main phases:
Data is collected from a collection, a channel or a generator function for example. In this step we convert a datasource to a Stream in order to process our data, we usually call it stream source.
Every operation in the pipeline is applied to each element in the sequence. Operations in this phase are called intermediate operations.
Completion with a terminal operation where the stream gets materialised.
Please remember that when defining a stream we just declare the steps to follow in our pipeline of operations, they won’t get executed until we call our terminal operation.
There are two interfaces in Java which are very important for the SPLIT and COMBINE phases; these interfaces are Spliterator and Collector.
The Spliterator interface allows two behaviours that are quite important in the split phase: iterating and the potential splitting of elements.
The first of these aspects is quite obvious, we’ll always want to iterate through our data source; what about splitting?
Splitting will take a big importance when running parallel streams, as it’ll be the one responsible for splitting the stream to give an independent “piece of work” to each thread.
Spliterator provides two methods for accessing elements:
boolean tryAdvance(Consumer<? super T> action); void forEachRemaining(Consumer<? super T> action);
And one method for splitting our stream source:
Since JDK 8, a spliterator method has been included in every collection, so streams use the Spliterator internally to iterate through the elements of a Stream.
Java provides implementations of the Spliterator interface, but you can provide your own implementation of Spliterator if for whatever reason you need it.
Java provides a set of collectors in Collectors class, but you could also do the same with Collector interface if you needed a custom Collector to combine your resulting elements in a different way!
Let’s see now how a Stream pipeline works internally and why is this important.
Java Stream operations are stored internally using a LinkedList structure and in its internal storage structure, each stage gets assigned a bitmap that follows this structure:
|SIZED||The size of the stream is known|
|DISTINCT||The elements in the stream are unique, no duplicates.|
|SORTED||Elements are sorted in a natural order.|
|ORDERED||Wether the stream has a meaningful encounter order; this means that the order in which the elements are streamed should be preserved on collection.|
So basically we could imagine this representation as for example:
Why is this so important? Because what this bitmap representation allows Java is to do stream optimisations.
Each operation will clear, set or preserve different flags; this is quite important because this means that each stage knows what effects causes itself in these flags and this will be used to make the optimisations.
For example, map will clear SORTED and DISTINCT bits because data may have changed; however it will always preserve SIZED flag, as the size of the stream will never be modified using map. Does that make sense?
Let’s look at another example to clarify things further; for example, filter will clear SIZED flag because size of the stream may have changed, but it’ll always preserve SORTED and DISTINCT flags because filter will never modify the structure of the data. Is that clear enough?
So how does the Stream use these flags for its own benefit? Remember that operations are structured in a LinkedList? So each operation combines the flags from the previous stage with its own flags, generating a new set of flags.
Based on this, we will be able to omit some stages in many cases! Let’s take a look at an example:
In this example we are creating a Set of String, which will always contain unique elements. Later on in our Stream we make use of distinct to get unique elements from our Stream; Set already guarantees unique elements, so our Stream will be able to cleverly skip that stage making use of the flags we’ve explained above. That’s brilliant, right?
We’ve learned that streams are able to make transparent optimisations to our Streams thanks to the way they’re structured internally, let’s look now at how do they get executed!
We already know that a Stream is lazily executed, so when a terminal operation gets executed what happens is that the Stream selects an execution plan.
There are two main scenarios in the execution of a Java Stream: when all stages are stateless and when NOT all stages are stateless.
To be able to understand this we need to know what stateless and stateful operations are:
- Stateless operations
A stateless operation doesn’t need to know about any other element to be able to emit a result. Examples of stateless operations are: filter, map or flatMap.
- Stateful operations
On the contrary, stateful operations need to know about all the elements before emitting a result. Examples of stateful operations are: sorted, limit or distinct.
What’s the difference in these situations then? Well, if all operations are stateless then the Stream can be processed in one go. On the other hand, if it contains stateful operations, the pipeline is divided into sections using the stateful operations as delimiters.
Let’s take a look at a simple stateless pipeline first!
Execution of stateless pipelines
We tend to think that streams will be executed exactly in the same order as we write them; that’s not correct, let’s see why.
Let’s consider the following scenario, where we have been asked to give a list with those employees with salaries below $80,000 and update their salaries with a 5% increase. The stream responsible for doing that would be the one shown below:
How do you think it’ll be executed? We’d normally think that the collection gets filtered first, then we create a new collection including the employees with their updated salaries and finally we’d collect the result, right? Something like this:
Unfortunately, that’s not actually how streams get executed; to prove that, we’re going to add logs for each step in our stream just expanding the lambda expressions:
If our initial reasoning was correct we should be seeing the following:
Filtering employee John Smith Filtering employee Susan Johnson Filtering employee Erik Taylor Filtering employee Zack Anderson Filtering employee Sarah Lewis Mapping employee John Smith Mapping employee Susan Johnson Mapping employee Erik Taylor Mapping employee Zack Anderson
We’d expect to see each element going through the filter first and then, as one of the employees has a salary higher than $80,000, we’d expect four elements to be mapped to a new employee with an updated salary.
Let’s see what actually happens when we run our code:
Filtering employee John Smith Mapping employee John Smith Filtering employee Susan Johnson Mapping employee Susan Johnson Filtering employee Erik Taylor Mapping employee Erik Taylor Filtering employee Zack Anderson Mapping employee Zack Anderson Filtering employee Sarah Lewis
Hmmm, that’s not what you were expecting, right? So actually how Java Streams are processed is more like this:
That’s quite surprising, right?
In reality the elements of a Stream get processed individually and then they finally get collected. This is VERY IMPORTANT for the well functioning and the efficiency of Java Streams! Why?
First of all, parallel processing is very safe and straightforward by following this type of processing, that’s why we can convert a stream to a parallel stream very easily!
Another big benefit of doing this is something called short-circuiting terminal operations. We’ll take a brief look at them later!
Execution of pipelines with stateful operations
As we mentioned earlier, the main difference when we have stateful operations is that a stateful operation needs to know about all the elements before emitting a result. So what happens is that a stateful operation buffers all the elements until it reaches the last element and then it emits a result.
That means that our pipeline gets divided into two sections!
Let’s modify our example shown in the last section to include a stateful operation in the middle of the two existing stages; we’ll use sorted to prove how Stream execution works. Please notice that in order to use sorted method with no arguments, Employee class has now to implement Comparable interface.
How do you think this will be executed? Will it be the same as our previous example with stateless operations? Let’s run it and see what happens.
Filtering employee John Smith Filtering employee Susan Johnson Filtering employee Erik Taylor Filtering employee Zack Anderson Filtering employee Sarah Lewis Mapping employee Erik Taylor Mapping employee John Smith Mapping employee Susan Johnson Mapping employee Zack Anderson
Surprise! The order of execution of the stages has changed!
Why is that?
As we explained earlier, when we use a stateful operation our pipeline gets divided into two sections. That’s exactly what has happened!
The sorted method cannot emit a result until all the elements have been filtered, so it buffers them before emitting any result to the next stage (map).
This is a clear example of how the execution plan changes completely depending on the type of operation; this is done in a way that is totally transparent to us.
Execution of parallel streams
We can execute parallel streams very easily by using parallelStream or parallel. So how does it work internally?
It’s actually pretty simple. Java uses trySplit method to try splitting the collection in chunks that could be processed by different threads.
In terms of the execution plan, it works very similarly, with one main difference. Instead of having one single set of linked operations, we have multiple copies of it and each thread applies these operations to the chunk of elements that it’s responsible for; once completed all the results produced by each thread get merged to produce one single and final result!
The best thing is that streams do this transparently for us! That’s great, isn’t it?
One last thing to know about parallel streams is that Java assigns each chunk of work to a thread in the common ForkJoinPool, in the same way as CompletableFuture does.
Now as promised, let’s take a brief look at short-circuiting terminal operations before we complete this section about how Streams work.
Short-circuiting terminal operations
Short-circuiting terminal operations are some kind of operations where we can “short-circuit” the stream as soon as we’ve found what we were looking for, even if it’s a parallel stream and multiple threads are doing some work.
If we take a closer look at certain operations like: limit, findFirst, findAny, anyMatch, allMatch or noneMatch; we’ll see that we don’t want to process the whole collection to get to a final result.
Ideally we’d want to interrupt the processing of the stream and return a result as soon as we find it. That’s easily achieved in the way streams get processed; elements get processed individually, so for example if we are processing a noneMatch terminal operation, we’ll finish the processing as soon as one element matches the criteria. I hope this make sense!
One interesting fact to mention in terms of execution is that for short-circuiting operations the tryAdvance method in Spliterator is called; however, for non short-circuiting operations the method called would be forEachRemaining.
That’s it from me! I hope now you have a good understanding of how streams work and that this helps you design stream pipelines easier!
If you need to improve your understanding of Java Streams and functional programming in Java, I’d recommend that you read “Functional Programming in Java: Harnessing the Power Of Java 8 Lambda Expressions”; you can buy it on Amazon in the following link.
If you are interested in learning more deeply about any Java topics, we recommend the following books:
Java Streams have been a massive improvement in Java language; not only our code is more readable and easier to follow, but also less error-prone and more fluent to write. Having to write complex loops and deal with variables just to iterate collections wasn’t the most efficient way of doing things in Java.
However, I think the main benefit is how Java Streams have enabled a way to do concurrent programming for anyone! You don’t need to be an expert in concurrency to write concurrent code anymore; although it’s good that you understand the internals to avoid possible issues.
The way Java processes streams in a clever way has cleared our paths to process collections and write concurrent programs and we should take advantage of it!
I hope that what we’ve gone through in this article has been clear enough to help you have a good understanding about Java Streams. I also hope that you’ve enjoyed this reading as much as I enjoy writing to share this with you guys!
In the next article we’ll be showing many examples on how to use Java Streams so if you’ve liked this article please subscribe/follow in order to be notified when a new article gets published. It was a pleasure having you and I hope I see you again!
Thank you very much for reading!
You must log in to post a comment.