Don't look at virtual threads, or else...!

2023-08-19

Tags:

Let me start this entry with a dad joke.

An electron rides a motorcycle. Suddenly, a police officer stops the electron and says:

‘I had to stop you, because you were speeding, driving exactly 178 and half kilometres per hour.’

‘Well, THANK YOU VERY MUCH, officer!’ the electron says, ‘Now I have absolutely no idea where I am!!!’

Some dad jokes are terrible, and don’t worry if you don’t get this one. It boils down to the uncertainty principle. In extremely oversimplified version: some things in the universe can’t be observed in all their aspects. If you have an apple, you can measure it’s mass, volume, size, (and if falls on your head) velocity, all at the same time. Without affecting the apple itself.
However, if you’re interested in electrons, the more accurately you know its speed, the less you know its location, and vice versa, hence atomic orbital. (Gosh, will atomic things stop chasing me? 😅)

Taking a step back

Before we go for the virtual threads (because that’s what lured you here in the first place), let’s take a step back and discuss what we already know, when it comes to Java.

What you might already know is that Stream.parallel() is discouraged in many situations and CompletableFuture is usually the preferred solution, if you need to squeeze a lot of juice from all the CPU cores you have. It’s because there’s no way to provide your own thread pool to parallel(), hence you’re limited to the common pool.

An example of this behaviour might look like this. First, we have a method which handles a task (and the task itself is computation heavy):

public static void handleTask(int id) {
    startedTasks.incrementAndGet();
    hardWork(5_000);
    logger.info(() -> "FINISHED %3d %s".formatted(id, Thread.currentThread()));
    finishedTasks.incrementAndGet();
}

The actual heavy-lifting takes place only in hardWork, obviously. And, as you could guess, it will take more or less five seconds. The rest is here just to allow us easily run our experiment (with vanilla Java and no 3rd party tools).

And it’s declared like this:

static {
    System.setProperty("java.util.logging.SimpleFormatter.format", "[%1$tF %1$tT %1$tZ] [%4$-7s] %5$s %n");
}

static AtomicInteger startedTasks = new AtomicInteger(0);
static AtomicInteger finishedTasks = new AtomicInteger(0);

Let’s say we have:

six seconds to complete as many tasks as we possibly can
OpenJDK Runtime Environment (build 21-ea+34-2500)
16 CPU cores

And we’re using Stream.parallel(), because reasons. The first approach is like this:

Thread.ofPlatform().daemon(true).name("stream-1").start(() -> {
    System.out.println("Started " + Thread.currentThread() + " to do some work");
    IntStream.range(0, 100).parallel().forEach(UncertaintyPrincipleOfVirtualThreads::handleTask);
});
Thread.sleep(Duration.ofSeconds(6));
logger.info("Tasks: started [%d], finished [%d]".formatted(startedTasks.get(), finishedTasks.get()));

What results are you expecting? I’m putting some white space here, so you can guess, before scrolling down.

The result logged at the end looks like this:

Started Thread[#30,stream-1,5,main] to do some work
[tasks finish here]
Tasks: started [32], finished [16]

And it makes sense, right? If each tasks takes ~5s, and we can use our cores for max. 6s, and because the common pool grows to its maximum in this scenario, we are able to complete 16 tasks and start (without completing) another 16. (If you run this on your machine, the results may vary. I don’t know how many CPU cores you have.)

Now, if it takes one cow to deliver one calf in ~280 days, surely two cows will deliver two calves in ~280 days, right? ;-)
Let’s see what this code can produce:

Thread.ofPlatform().daemon(true).name("stream-1").start(() -> {
    System.out.println("Started " + Thread.currentThread() + " to do some work");
    IntStream.range(0, 100).parallel().forEach(UncertaintyPrincipleOfVirtualThreads::handleTask);
});

Thread.ofPlatform().daemon(true).name("stream-2").start(() -> {
    System.out.println("Started " + Thread.currentThread() + " to do some OTHER work");
    IntStream.range(100, 200).parallel().forEach(UncertaintyPrincipleOfVirtualThreads::handleTask);
});

And we can see:

Started Thread[#30,stream-1,5,main] to do some work
Started Thread[#31,stream-2,5,main] to do some OTHER work
[tasks finish here]
Tasks: started [34], finished [17]

Wait, WAT? We doubled the amount of parallel streams and instead of doubling the results, 2 × 16 = 32, we get only one, just ONE more task finished?

It’s time to see how tasks actually finished in both cases. For a single stream it may look like this:

FINISHED  65 Thread[#30,stream-1,5,main] 
FINISHED  48 Thread[#37,ForkJoinPool.commonPool-worker-7,5,main] 
FINISHED  32 Thread[#31,ForkJoinPool.commonPool-worker-1,5,main] 
FINISHED  42 Thread[#40,ForkJoinPool.commonPool-worker-10,5,main]
FINISHED  40 Thread[#35,ForkJoinPool.commonPool-worker-5,5,main] 
FINISHED  90 Thread[#34,ForkJoinPool.commonPool-worker-4,5,main] 
FINISHED  28 Thread[#38,ForkJoinPool.commonPool-worker-8,5,main] 
FINISHED  38 Thread[#36,ForkJoinPool.commonPool-worker-6,5,main] 
FINISHED  15 Thread[#32,ForkJoinPool.commonPool-worker-2,5,main] 
FINISHED  22 Thread[#41,ForkJoinPool.commonPool-worker-11,5,main]
FINISHED  43 Thread[#42,ForkJoinPool.commonPool-worker-12,5,main]
FINISHED   7 Thread[#39,ForkJoinPool.commonPool-worker-9,5,main] 
FINISHED  47 Thread[#44,ForkJoinPool.commonPool-worker-14,5,main]
FINISHED  57 Thread[#45,ForkJoinPool.commonPool-worker-15,5,main]
FINISHED  44 Thread[#33,ForkJoinPool.commonPool-worker-3,5,main] 
FINISHED  82 Thread[#43,ForkJoinPool.commonPool-worker-13,5,main]

And when using two parallel streams, the result is as following:

FINISHED  97 Thread[#43,ForkJoinPool.commonPool-worker-12,5,main]
FINISHED  65 Thread[#30,stream-1,5,main] 
FINISHED 115 Thread[#35,ForkJoinPool.commonPool-worker-4,5,main] 
FINISHED  82 Thread[#37,ForkJoinPool.commonPool-worker-6,5,main] 
FINISHED 190 Thread[#39,ForkJoinPool.commonPool-worker-8,5,main] 
FINISHED  78 Thread[#44,ForkJoinPool.commonPool-worker-13,5,main]
FINISHED 144 Thread[#42,ForkJoinPool.commonPool-worker-11,5,main]
FINISHED 107 Thread[#36,ForkJoinPool.commonPool-worker-5,5,main] 
FINISHED  15 Thread[#38,ForkJoinPool.commonPool-worker-7,5,main] 
FINISHED 132 Thread[#33,ForkJoinPool.commonPool-worker-2,5,main] 
FINISHED  32 Thread[#32,ForkJoinPool.commonPool-worker-1,5,main] 
FINISHED  85 Thread[#46,ForkJoinPool.commonPool-worker-15,5,main]
FINISHED  90 Thread[#34,ForkJoinPool.commonPool-worker-3,5,main] 
FINISHED  57 Thread[#45,ForkJoinPool.commonPool-worker-14,5,main]
FINISHED 122 Thread[#40,ForkJoinPool.commonPool-worker-9,5,main] 
FINISHED 165 Thread[#31,stream-2,5,main] 
FINISHED  44 Thread[#41,ForkJoinPool.commonPool-worker-10,5,main]

Despite running two parallel streams, we still have the same common worker pool. The one extra result came from stream-2 thread. And we can see (because some finished tasks are >= 100, that the second stream is actually “stealing” workers) from the pool before there first stream can make them busy.

If we go a little wild and separate both streams by Thread.sleep(10) (which is something we never do in prod code, right? RIGHT?!?), most likely we will see only one task >= 100 processed. The overall number won’t change, because the first stream makes all workers from the common pool busy, before the second can handle only one with its own stream-2.

So the parallel streams can’t grow linearly, no matter how many you’ll toss.

Wait a second, but there’s this new thing in Java 21, even your boss heard about it, so they tell you:

Virtual threads to the rescue, just use them!

Sadly, you didn’t have a chance to attend this talk from that funny Polish guy, so you all keep hoping, virtual threads are going to squeeze more juice from your CPU!

And the team adds some feature flags, and hides behind them a new implementation, which is more or less the following:

for (int i = 0; i < 100; i++) {
    int taskId = i;
    Thread.ofVirtual().start(() -> {
        handleTask(taskId);
    });
}

And now, because virtual threads are so lightweight, you hope to get at hand more results than 16! How many will we get this time?

No way. This can’t be. Two sprints spent on the rewrite only to get:

Tasks: started [32], finished [16]

WTF? Okay, we spent two weeks on coding, because it was much easier to code than to read the manual for one hour. And the manual says that under the hood the virtual threads are using ForkJoin pool! And it also says, that by default it is equal to the number of available processors!

Not all is lost, it can also be tuned, but before tuning, we decide to take a closer look at the actual task handling implementation, to measure its progress, and maybe shave some time. After all, making it run faster than 5 seconds will make everything work faster, virtual threads or not. So we fall back to good ol' log("ONE") and log("TWO") to trace the progress. We split the task into smaller steps and put some ~~trace~~ fine statements. So now our tasks are handled like this:

public static void handleTask(int id) {
    startedTasks.incrementAndGet();
    report(id, "1");
    hardWork(1_000);
    report(id, "2");
    hardWork(1_000);
    report(id, "3");
    hardWork(1_000);
    report(id, "4");
    hardWork(1_000);
    report(id, "5");
    hardWork(1_000);
    logger.info(() -> "FINISHED %3d %s".formatted(id, Thread.currentThread()));
    finishedTasks.incrementAndGet();
}

private static void report(int taskId, String stage) {
    logger.fine(() -> "STEP %s %3d %s".formatted(stage, taskId, Thread.currentThread()));
}

And obviously, we can’t forget to increase the logging precision, so in the static init block:

logger.setLevel(java.util.logging.Level.FINE);

We’re super excited, we hit the run button and…

FINISHED  26 VirtualThread[#58]/runnable@ForkJoinPool-1-worker-8
FINISHED  21 VirtualThread[#52]/runnable@ForkJoinPool-1-worker-4
FINISHED  30 VirtualThread[#62]/runnable@ForkJoinPool-1-worker-2
Tasks: started [100], finished [3]

Wait, what? WT actual F? WHERE MY PERFORMANCE??!!!1one

Just to be sure we didn’t break things accidentally, we flip the feature flag back and check the results with two parallel streams. And it’s just like it was before: Tasks: started [34], finished [17].

Well, it seems that virtual threads behaved exactly like the electron. We wanted to take a closer look at them, and instead they’re… gone? What is this nonsense? Have they

Introduced Quantum Mechanics to Java?

First things first. Why don’t we see the FINE statements at all? Well, that’s not so difficult to figure out ;-)

Seeing the logged STEPs is helpful, although not mandatory. After initial STEP 1, we start seeing something like:

STEP 1 27 VirtualThread[#61]/runnable@ForkJoinPool-1-worker-9 
STEP 2 10 VirtualThread[#43]/runnable@ForkJoinPool-1-worker-6 
STEP 2 15 VirtualThread[#48]/runnable@ForkJoinPool-1-worker-11 
STEP 1 30 VirtualThread[#64]/runnable@ForkJoinPool-1-worker-11 
STEP 2 16 VirtualThread[#49]/runnable@ForkJoinPool-1-worker-9 
STEP 2 19 VirtualThread[#53]/runnable@ForkJoinPool-1-worker-3 
STEP 1 28 VirtualThread[#62]/runnable@ForkJoinPool-1-worker-6

and if we keep scrolling, even such output is revealed to our eyes:

STEP 1 47 VirtualThread[#81]/runnable@ForkJoinPool-1-worker-11
STEP 4 20 VirtualThread[#54]/runnable@ForkJoinPool-1-worker-15
STEP 2 21 VirtualThread[#55]/runnable@ForkJoinPool-1-worker-9 
STEP 5  8 VirtualThread[#41]/runnable@ForkJoinPool-1-worker-12 
STEP 5  5 VirtualThread[#38]/runnable@ForkJoinPool-1-worker-14 
STEP 1 29 VirtualThread[#63]/runnable@ForkJoinPool-1-worker-14

That may look strange… in the streams approach things are “as expected”, and the steps are processed in sort of batches… First we see 17 tasks in STEP 1, then they all move to STEP 2, and so on, and after FINISHing, the workers pick next tasks in STEP 1. So as long as a task is running, the workers from the underlying ForkJoin pool keep working on their tasks, from the beginning until completion.

However, this is clearly not the case of the VirtualThreads… In the very first output we saw this:

1STEP 1 27 VirtualThread[#61]/runnable@ForkJoinPool-1-worker-9 
2STEP 2 10 VirtualThread[#43]/runnable@ForkJoinPool-1-worker-6 
3STEP 2 15 VirtualThread[#48]/runnable@ForkJoinPool-1-worker-11 
4STEP 1 30 VirtualThread[#64]/runnable@ForkJoinPool-1-worker-11 
5STEP 2 16 VirtualThread[#49]/runnable@ForkJoinPool-1-worker-9 
6STEP 2 19 VirtualThread[#53]/runnable@ForkJoinPool-1-worker-3 
7STEP 1 28 VirtualThread[#62]/runnable@ForkJoinPool-1-worker-6

Task #27 was taken to STEP 1 by Virtual Thread[#61], using worker-9. Then instead of carrying this very task #27 further, worker-9 got assigned to VirtualThread[#49], handling STEP 2 of task #16. According to the output, the same happened to worker-6 and worker-11. And I guess it’s quite safe to say that statistically this happened to every task and virtual thread. The three tasks which got completed were lucky winners, which were taken through all five steps by some workers, not necessarily the same workers from STEP 1 to STEP 5.

What we see here in action is

The Very Purpose Of Virtual Threads

Whenever they encounter an IO operation, they unmount the carrier thread. This way the carrier thread can pick up another virtual thread to carry and keep carrying it as long as the other virtual thread doesn’t call an IO operation.

It’s pretty much like with taxis or shared rides. Once you got into the car (you previously summoned somehow), the driver doesn’t kick you out, until you reach your destination. And the drivers don’t care if it’s a subsequent ride for you that day, or the first one.
Uber driver won’t prefer you just because earlier tonight someone from Uber drove you to the restaurant, and now you want to change your location. Statistically, everyone is equal here. So the solution is not to let the taxi go, but keep it and pay for it. Or simply don’t exit the car ;-)

In our example that means: don’t call unnecessary IO / logging, because this splits your journey into smaller trips. Now, the total mileage for all persons / tasks will be more or less the same. But if you prefer some people / tasks to actually complete the whole journey, this is what you have to take into account.

It’s just important to remember what virtual threads are meant for and good at. If you have some IO operations (not only logging, but also, or maybe even predominantly, DB queries, file operations, network calls), they allow switching CPU power to other tasks, instead of waiting needlessly.
Pretty much like with taxi cars: if you get into a restaurant to have dinner with your friends, you hardly ever tell the driver ‘Please park here and keep the engine running, I’ll pay for that’. Instead, you release the taxi and call it when you actually have to go back home and in the meantime other people can use the taxi, which is a costly resource.

However, good luck calling a taxi after a huge concert ;-)

It’s also true that virtual threads are threads, and you can easily verify it on your own:

Thread.ofVirtual().start(() -> {
    if(Thread.currentThread() instanceof Thread) {
        System.out.println("virtual thread is a thread");
    }
});

Therefore, we can use virtual threads in all places, libraries and frameworks, where we used good ol' Threads so far. Only please be aware, that IO calls will chop your journeys into smaller segments. And that there’s this pool “which can be tuned”, but (at the moment of writing) using our own pool for virtual threads is not allowed. For this we have to stick to Completable Futures.

Hey, but I can force my virtual thread not to leave the cab!

Sure you can, because we could make the whole task synchronized. And perhaps in our case that maybe could work without paying too much performance penalty (because we’re just logging to local console).

For now, because this mechanism may change in future Java versions. And if the report operation takes longer (say, a blocking remote logging is in place), then it will needlessly pin the carrier thread. Usually it’s one of these Really Bad Ideas™, which I described here and here.

I would suggest monitoring the progress of tasks without relying on IO called inside virtual threads. And, as demonstrated, log("HERE") counts as one.

Different pools

There’s one more thing to finish this entry. In case someone didn’t pay attention both to the manual, both to the output…

This happens with streams:

FINISHED  48 Thread[#37,ForkJoinPool.commonPool-worker-7,5,main]

and this happens with virtual threads (now):

FINISHED  26 VirtualThread[#58]/runnable@ForkJoinPool-1-worker-8

Yes, these are two different Fork Join pools. Hence, we can get rid of the feature flag and run the tasks in both streams and virtual threads (without FINE logging).

And therefore we can see then any of these results:

Tasks: started [60], finished [18]
Tasks: started [60], finished [19]
Tasks: started [50], finished [7]
Tasks: started [66], finished [30]

Why? Well, it’s yet another entry to write. TL;DR: there are reasons behind “thread per core”.

Whoa, this turned out to be longer than I expected ;-) You’re more than welcome to discuss in social media. Links in the footer.