Apache Hadoop, Apache Spark, Akka, Java 8 streams and Quasar: The classic use cases to the newest concurrency approaches for Java developers
There’s a lot of chatter going around about newer concepts in concurrency, yet many developers haven’t had a chance to wrap their heads around them yet. In this post we’ll go through the things you need to know about Java 8 streams, Hadoop, Apache Spark, Quasar fibers and the Reactive programming approach – and help you stay in the loop, especially if you’re not working with them on a regular basis. It’s not the future, this is happening right now.
What are we dealing with here?
When talking about concurrency, a good way to characterize the issue at hand is answering a few questions to get better feel for it:
- Is it a data processing task? If so, can it be broken down to independent pieces of work?
- What’s the relationship between the OS, the JVM and your code? (Native threads Vs. light-weight threads)
- How many machines and processors are involved? (Single core Vs. Multicore)
Let’s go through each of these and figure out the best use cases to each approach.
1. From Thread Pools to Parallel Streams
Data processing on single machines, letting the Java take care of thread handling
With Java 8, we’ve been introduced to the new Stream API that allows applying aggregate operations like Filter, Sort or Map on streams of data. Another thing Streams allow are parallel operations on multicore machines when applying .parallelStream() – Splitting the work between threads using the Fork/Join framework introduced in Java 7. An evolution from the Java 6 java.util.concurrency library, where we met the ExecutorService which creates and handles our worker thread pools.
Fork/Join is also built on top of the ExecuterService, the main difference from a traditional thread pool is how they distribute the work between threads and thereby multicore machine support. With a simple ExecuterService you’re in full control of the workload distribution between worker threads, determining the size of each task for the threads to handle. With Fork/Join on the other hand, there’s a work-stealing algorithm in place that abstracts workload handling between threads. In a nutshell, this allows large tasks to be divided to smaller ones (forked), and processed in different threads, eventually joining the results – Balancing the the work between threads. However, it’s not a silver bullet.
Sometimes Parallel Streams may even slow you down, so you’ll need to think it through. Adding .parallelStream() to your methods can cause bottlenecks and slowdowns (some 15% slower on this benchmark we ran), the fine line goes through the number of threads. Let’s say we’re already running multiple threads and we’re using .parallelStream() in some of them, adding more and more threads to the pool. This could easily turn into more than our cores could handle, and slow everything down due to increased context switching.
Bottom line: Parallel Streams abstract handling threads on a single machine in a way that distributes the workload between your cores. However, if you want to use them efficiently it’s critical to keep the hardware in mind not spawn more threads than your machine can handle.
2. Apache Hadoop and Apache Spark
Heavy duty lifting: Big data processing across multiple machines
Moving on to multiple machines, petabytes of data, and tasks that resemble pulling all tweets that mention Java from twitter or heavy duty machine learning algorithms. When speaking of Hadoop, it’s important to take another step and think of the wider framework and its components: The Hadoop Distributed File System (HDFS), a resource management platform (YARN), the data processing module (MapReduce) and other libraries and utilities needed for Hadoop (Common). On top of these come other optional tools like a database which runs on top of HDFS (HBase), a platform for a querying language (Pig), and a data warehouse infrastructure (Hive) to name a few of the popular ones.
This is where Apache Spark steps in as a new data processing module, famous for its in-memory performance and the use of fast performing Resilient Distributed Datasets (RDDs), unlike the Hadoop MapReduce which doesn’t employ in-memory (and on-disk) operations as efficiently. The latest benchmark released by Databricks shows that Spark was 3x faster than Hadoop in sorting a petabyte of data, while using 10x less nodes.
The classic use case for Hadoop would be querying data, while Spark is getting famous for its fast runtimes of machine learning algorithms. But this is only the tip of the iceberg, as stated by Databricks: “Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk”.
Bottom line: Spark is the new rising star in Hadoop’s ecosystem. There’s a common misconception that we’re talking about something unrelated or competing, but I believe that what we’re seeing here is the evolution of the framework.
Java / Scala Developers – Know when and why production code breaks – read more
3. Quasar fibers
Breaking native threads to virtual light-weight threads
We’ve had the chance to run through the Hadoop, now let’s back to single machines. In fact, let’s zoom in even further than the standard multithreaded Java application and focus on one single thread. As far as we’re concerned, HotSpot JVM threads are the same as native OS threads, holding one thread and running “virtual” threads within it is what fibers are all about. Java doesn’t have a native fibers support, but no worries, Quasar by Parallel Universe got us covered.
Quasar is an open source JVM library that supports fibers (Also known as light weight threads), and also acts as an Actor Framework, which I’ll mention later. Context switching is the name of the game here. As we’re limited by the number of cores, once the native thread count grows larger we’re subjected to more and more context switching overhead. One way around this is fibers, using a single thread that supports “multithreading”. Looks like a case of threadcepiton.
Fibers can also be seen as an evolution from thread pools, dodging the dangers of thread overload we went through with Parallel Streams. They make it easier to scale threads and allow a significantly larger number of concurrent “light” threads. They’re not intended to replace threads and should be used for code that blocks relatively often, it’s like they’re acting as true async threads.
Bottom line: Parallel Universe is offering a fresh approach to concurrency in Java, haven’t reached v1.0 yet but definitely worth checking out.
4. Actors & Reactive Programming
A different model for handling concurrency in Java
In the Reactive Manifesto, the new movement is described with 4 principles: Responsive, Resilient, Elastic and Message-Driven. Which basically means fast, fault tolerant, scalable and suuports non-blocking communication.
Let’s see how Akka Actors support that. To simplify things, think of Actors as people that have a state and a certain behavior, communicating by exchanging messages that go to each other’s mailbox. An Actor system as a whole should be created per application, with a hierarchy that breaks down tasks to smaller tasks so that each actor has only one supervising actor at most. An actor can either take care of the task, break it down event further with delegation to another actor or in case of failure, escalate it to his supervisor. Either way, messages shouldn’t include behavior or share mutable states, each Actor has an isolated stated and behavior of its own.
It’s a paradigm shift from the concurrency models most developers are used to. And a bit of an off-shoot from the evolution in the first 3 topics we covered here. Although its roots stem back from the 70’s, its been under the radar just until recent years with a revival to better fit modern application demands. Parallel Universe’s Quasar also supports Actor, based on its light-weight threads. The main difference in implementation lies in the fibers/light-weight threads.
Bottom line: Taking on the Actor model takes managing thread pools off your back, leaving it to the toolkit. The revival of interest comes from the kind of problems applications deal with today, highly concurrent systems with much more cores that we can work with.
We’ve ran through 4 methods to solve problems using concurrent or parallel algorithms with the most interesting approaches to tackle today’s challenges. Hopefully this helped pique your interest and get a better view of the hot topics in concurrency today. Going beyond the thread pools, there’s a trend of delegating this responsibility to the language and its tools – Focusing dev resources on shipping new functionality rather than spending countless hours solving race conditions and locks.
Know great Java hacks? Always wanted to share coding insights with fellow developers?
If you want to contribute to OverOps Blog, and reach more than 80,000 developers a month, we’re looking for guest writers! Shoot us an email with your idea for a guest post: firstname.lastname@example.org.