Skip to content

Wisp Documentation

yulei edited this page Aug 19, 2020 · 5 revisions

Wisp Documentation

Wisp2 is a stackful symmetric coroutine implemented at the JVM level. Compared with bytecode weaving technology such as Kotlin and Kilim, Wisp has inserted scheduling support on JDK blocking interfaces, thus enabling existing Java applications to achieve performance improvement brought by coroutine without source modification.

The most important idea of Wisp2 is letting the runtime infrastructure provide complete capabilities as much as possible, so that applications can still use thread-based programming models and interfaces. Java thread could transfer to coroutine via JVM parameters. Compared with the solution which needs to modify source code, it has the following benefits:

  • Low migration cost: No source code change and recompilation required, and great compatibility for the legacy code.
  • Reuse existing concurrency mechanisms, such as j.u.c.
  • Troubleshooting capabilities Improvement:
    • There is not any source code change required, hence easy judgment about whether a problem is introduced by the coroutine (by switching between Wisp and the normal flow).
    • If some tools are not compatible while in Wisp (for example, strace), you can fast and temporarily disable Wisp for troubleshooting.

Note: Not all scenarios are suitable for acceleration by using coroutine(such as CPU intensive workloads). With user-transparent usage, it is easy to switch between Wisp and the normal flow.

To start using Wisp

Take a pingpong 1,000,000-time program as an example, which is a blocking and switching-intensive application.

             o      .   _______ _______
              \_ 0     /______//______/|   @_o
                /\_,  /______//______/     /\
               | \    |      ||      |     / |

static final ExecutorService THREAD_POOL = Executors.newCachedThreadPool();

public static void main(String[] args) throws Exception {
    BlockingQueue<Byte> q1 = new LinkedBlockingQueue<>(), q2 = new LinkedBlockingQueue<>();
    THREAD_POOL.submit(() -> pingpong(q2, q1)); // thread A
    Future<?> f = THREAD_POOL.submit(() -> pingpong(q1, q2)); // thread B
    q1.put((byte) 1);
    System.out.println(f.get() + " ms");
}

private static long pingpong(BlockingQueue<Byte> in, BlockingQueue<Byte> out) throws Exception {
    long start = System.currentTimeMillis();
    for (int i = 0; i < 1_000_000; i++) out.put(in.take());
    return System.currentTimeMillis() - start;
}

Run the test to check the execution time:

$java PingPong
13212 ms

// Enable Wisp2
$java -XX:+UseWisp2 -XX:ActiveProcessorCount=1 PingPong
882 ms

After enabling Wisp2, the execution performance has increased by nearly ten times. What happened? The coroutines are on the same carrier, and they can be scheduled without the participation of the OS. The efficiency can be very high.

Background

In the history of Java, J2EE specifications provide guiding suggestions for Java Server Application Programming Models. Some specifications such as Servlet and JDBC are providing concurrency capabilities based on threads: Servlet containers usually use thread pools to handle concurrent services for multiple requests. During request processing, databases CRUD are invoked through JDBC, which requires blocking threads. However, this request is handled in a separate thread, so we don't need to worry about this kind of blocking.

In recent years, we can see that the server-side programming ecosystem has turned to asynchronously. For example, the asynchronous support of Servlet 3.0 and Vert.X all use asynchronous technology. The fundamental reason to promote the development of these technologies is that the rapid increase of the Internet, which rise the extreme requirements for single-machine processing capability. However, thread blocking models are usually limited by the scheduling capability of the OS on threads. The asynchronous model can starts only a small number of threads and perform batch callback processing for different requests, and minimize the intervention of OS in thread scheduling to achieve higher performance.

Java provides the fundamental I/O infrastructure such as NIO and AIO, which can help us create high-performance java server programs in a non-blocking event-driven model. In this context, excellent IO frameworks like Netty gives us a new choice to write pure Asynchronous Service. When programming in this kind of framework, we need to be very careful. Usually, callbacks of IO events occur in the IO thread. If a blocking call is accidentally introduced into the callback, as a result, subsequent I/O events can't be processed in time, which affects the processing of the entire event framework. Therefore, we introduce coroutine to reduce the difficulty of writing asynchronous programs.

Improve performance by using asynchronous and coroutine

private void writeQuery(Channel ch) {
    ch.write(Unpooled.wrappedBuffer("query".getBytes())).sync();
    logger.info("write finish");
}

In the example code sync() blocks current thread, which isn't expected. Since Netty itself is an asynchronous framework, we introduce callbacks:

private void writeQuery(Channel ch) {
    ch.write(Unpooled.wrappedBuffer("query".getBytes()))
      .addListener(f -> logger.info("write finish"));
}

There is only one callback for now. As the program becomes complex multiple layers of callbacks and control flow flipping will be introduced. Therefore, we need to reduce such callbacks with the help of coroutine. Let's take a look at an example of kotlin coroutine about how to help us simplify the program:

suspend fun Channel.aWrite(msg: Any): Int =
    suspendCoroutine { cont ->
        write(msg).addListener { cont.resume(0) }
    }

suspend fun writeQuery(ch: Channel) {
    ch.aWrite(Unpooled.wrappedBuffer("query".toByteArray()))
    logger.info("write finish")
}

Here introduces a special function suspendCoroutine, we can get a reference to the current Continuation, execute a piece of code, and finally suspend the current coroutine. Continuation represents the latter part of the current calculation, and we can restore the execution context through Continuation.resume(). Therefore, we only need to call back cont.resume() when the write operation is completed, and we return to the execution state at suspendCoroutine (including the complete call stack), the program continues to execute, the code returns, and the log is output. Looking into writeQuery, we have completed the asynchronous operation in a synchronous way. When the coroutine is switched by suspendCoroutine, the carrier thread can continue to schedule other executable coroutines for execution.

We only need a mechanism to yield/restore the execution context, and use non-blocking + callback in the blocking library function to yield/restore the coroutine, so that the program written in synchronous form can achieve the same performance as asynchronous.

JVM-based coroutine

Based on the above example, we can see that the following things need to be done to improve performance using coroutines:

  • Use the suspend keyword to mark the frame that needs to support yield.
  • Introduce an asynchronous framework and use the callback interface to encapsulate the coroutine synchronous interface

Based on JVM, we can omit the preceding steps:

  • The JVM clearly knows the layout of the stack and can also perform arbitrary operations on the stack. Therefore, all existing methods can support coroutine switching without recompiling with the state machine mechanism.
  • All blocking functions (Blocking IO, ObjectMonitor) are called through the JDK, so synchronous encapsulation can be made at the JDK level.

Wisp is based on JVM stack operation, combined with Runtime's supports for blocking methods, and added a scheduler to allow applications to obtain asynchronous performance without modification.

Scope

Based on the JKU coroutine of the DaVinci project, Wisp is a native context switching solution of JVM. It does not affect the efficiency of code execution and can suspend the coroutine without maintenance costs, completely transparent to application code.

Loom, as a future standard, proposes the VirtualThread API to allow users to explicitly create coincidences. Wisp2 chooses not to provide APIs, instead, and provides a new implementation mechanism for Thread (the standard does not specify the implementation method of Thread, which can be pthread or user-level Thread), from the user's point of view, it is fully compatible with the current Java standard. Orientations are different from each other from this perspective, and the reason is simple:

  • Loom: Shared stacks consume less memory. A new programming model that supports millions of coroutines.
  • Wisp2: Independent stacks which have a fast switching speed. Based on the existing programming model, performance is improved by reducing OS interventions and efficient scheduling transparency.

Summary:

  • Wisp2 is compatible with existing code writing methods without adding annotations or recompiling.
  • Wisp2 is compatible with java core libs because conversions from all threads to coroutines are required.
  • Wisp2 provides efficient coroutine switching and scheduling, because performance is the biggest demand.
  • Wisp2 does not intend to support a large number of coroutines because our goal is to make compatible with existing Java programming models.

Platform support

The Wisp2 implementation relies on the proprietary features of epoll and currently only supports Linux operating systems.

Terminology

  • Carrier: the actual executor of the code, corresponding to a thread provided by the operating system, including the context information (IO events and scheduled events) and data structures required for scheduling.
  • WispEngine: an executor composed of a group of carriers. After the coroutine is created, it is bound to an engine and is executed alternately by the carriers under the engine.
  • Scheduler: corresponds to a WispEngine. The cooperation between carriers and the steal policy is implemented by the scheduler.
  • Work-stealing: carriers steal tasks from each other to balance the queue length.
  • SysMonitor: SysMonitor monitors the execution status of tasks on each carrier and preempts the coroutines when no switch occurs for a long time.
  • Preemption: allows coincidences that run CPU code for a long time to actively yield CPU
  • Coroutine: the Coroutine provided by JKU, which mainly provides the switching capability.
  • WispTask: the scheduling structure required by and corresponds to a coroutine. Generally speaking, it is mapped to a Thread object.
  • EventPump: the event source, which is generally a fd event on the network. On Linux platforms, an epoll eventloop is used.
  • The AllThreadAsWisp mode: converts all Java threads to coroutines

Design & Implementation

Tiered design

  • Coroutine: an open-source project based on JKU, which provides stack switching capabilities.
  • WispEngine: depends on the yielding ability of coroutine, and provides interfaces such as park/unpark/registerEvent/timer for the core library.
  • Runtime supports: call the interfaces of WispEngine in standard libraries where JDK has a blocking operation or a synchronized keyword is being executed in slowpath.

Worker Threads

The Wisp implementation contains some specific threads:

  • WispSysmon: checks coroutines which are executing for too long a time
  • WispDaemon: maintains the Java daemon semantics after a thread is converted to a coroutine.
  • WispEventPump: listens for distribution events, which are inherited to WispCarrier in most cases.
  • WispCarrier: main worker thread that schedules and executes the business logic in the coroutine.

Scheduling

The carrier thread itself continuously pulls a coroutine from the workQueue, which will be scheduled for execution. In this case, the coroutine is bond to the carrier. When the coroutine needs to be suspended during execution, it will switch to the context of the carrier thread and continue to trigger schedule. When the carrier finds its queue empty, it will consider stealing tasks to help other carriers reduce pressure. Let's take JDBC running in coroutine as an example:

  1. The coroutine reads the socket at the bottom of the JDBC driver. At this time, the data packet has not been received. The coroutine needs to wait, so it
    1. registers the read time of this socket to eventPump
    2. switches to carrier logic to give up control flow
  2. The carrier scans the queue and finds that no tasks need to be executed. It suspends itself to save resources.
  3. The event on eventPump is ready to wake up the ECS instance.
    1. enqueues the coroutine
    2. wakes up the carrier thread
  4. Carrier takes the coroutine from the queue, continues to read the socket, and executes subsequent logic.

By using the preceding approach, coroutines live under the carrier thread share the computing resources of their master.

AllThreadAsWisp

After Wisp2 is enabled with -XX:+UseWisp2, a new thread implementation is provided. User code continues to use the Thread interface, while pthread (kernel-level thread) is replaced with WispTask (user-level Thread). image.png

Preemption

We have implemented a preemption mechanism to solve the problem when collaborative scheduling cannot give up CPU in time. When a coroutine cost too long a time in execution:

  • Sysmon detects that the coroutine execution time is too long.
  • Sysmon indicates that the carrier is preemptible.
  • Trigger a safepoint
  • Before the end of safepoint, the carrier checks whether preemption is required. If needed, the carrier actively inserts a yield operation, and the current coroutine gives up the control flow.

Compatibility

GC

Coroutines should support GC because it is needed for object references on each coroutine stack to be correctly handled by garbage collectors, of which G1 and CMS GC with Wisp have been evaluated on a large scale, yet other collectors are not. Nevertheless, Wisp implements GC APIs, where all objects on coroutine stack frames and the coroutine-local metadata are collected during garbage collecting, so it can support all kinds of garbage collectors in theory.

Threading

The ThreadLocal class is semantically equivalent to "CoroutineLocal" when running in Wisp. Wisp supports almost all APIs of the Thread class and generates a conversion to coroutine semantics for user transparency.

java.lang.Thread APIs

  • public static Thread currentThread(): Returns a reference to the currently executing coroutine object.
  • public StackTraceElement[] getStackTrace(): Returns an array of stack trace elements representing the stack dump of this coroutine.
  • public static void yield(): A hint to the scheduler that the current coroutine is willing to yield its current use of a processor.
  • public static void sleep(long millis) throws InterruptedException: Causes the currently executing coroutine to sleep (temporarily cease execution) for the specified number of milliseconds, subject to the precision and accuracy of system timers and schedulers. The coroutine does not lose ownership of any monitors.
  • public void start(): All newly opened Threads will be transformed to a new Wisp coroutine and the Thread.run() method will be converted to run in a coroutine: causes this coroutine to begin execution; the Java Virtual Machine calls the run method of this coroutine.
  • public void interrupt(): Interrupts this coroutine.
  • public static boolean interrupted(): Tests whether the current coroutine has been interrupted.
  • public boolean isInterrupted(): Tests whether this coroutine has been interrupted.
  • public final void setDaemon(boolean on): Marks this coroutine as either a daemon coroutine or a user coroutine.
  • public final boolean isDaemon(): Tests if this coroutine is a daemon coroutine.
  • public final void setName(String name): Changes the name of this coroutine to be equal to the argument name.
  • public final String getName(): Returns this coroutine's name.
  • public final boolean isAlive(): Tests if this coroutine is alive. A coroutine is alive if it has been started and has not yet died.
  • public final void join() throws InterruptedException: Waits for this coroutine to die.
  • public static boolean holdsLock(Object obj): Returns true if and only if the current coroutine holds the monitor lock on the specified object.
  • public long getId(): Returns the identifier of this coroutine.

Public APIs currently not supported:

  • public static void setDefaultUncaughtExceptionHandler(Thread.UncaughtExceptionHandler eh)
  • public void setUncaughtExceptionHandler(Thread.UncaughtExceptionHandler eh)
  • public static Map<Thread,StackTraceElement[]> getAllStackTraces()
  • public Thread.State getState()

java.lang.ThreadLocal APIs

All uses of methods in the ThreadLocal class are isolated at a coroutine level implicitly -- hence the needlessness for users to change any code.

java.lang.Object APIs

public final void wait() throws InterruptedException: Causes the current coroutine to wait until it is awakened, typically by being notified or interrupted. During this time, the current coroutine will be scheduled to another coroutine. public final void notify(): Wakes up a single coroutine that is waiting on this object's monitor. public final void notifyAll(): Wakes up all coroutines that are waiting on this object's monitor.

Locking

J.U.C

On account of the compatibility with Thread APIs, JUC locks (such as java.util.concurrent.locks.ReentrantLock, etc.) will automatically be converted to a coroutine level while in Wisp, for which users can normally make use of classes inside J.U.C concurrency packages.

ObjectMonitor

Inside the Java Virtual Machine, ObjectMonitor changes from thread-level object locks to coroutine-level object locks for maintaining the semantics, which behaves as follows:

  1. When an ObjectMonitor waits (parks) since it fails to acquire a lock, Wisp will automatically switch the control flow to the next scheduled coroutine to reduce the waiting time.
  2. Wisp supports coroutine-level ObjectMonitor in the interpreter, compilers, and runtime, also the work-stealing when waiting on it, and its deoptimization.

IO

  • Wisp supports BIO, also ServerSocketChannel.accept() and Selector.select() in NIO blocking mode.
  • The connect()/ read()/ write() in NIO blocking mode are not supported, for the above methods will not be used in blocking mode in frameworks such as Netty and Tomcat. We will support them in the near future.

Limitations

Biased Lock

The biased lock, which Wisp does not support due to the complexity of the internal implementation and the removal in the future, will be forcibly closed.

JNI blocking

If the user code is blocking in self-written JNI, which at this point does not have Wisp scheduling logic, it will block the entire carrier thread, resulting in a large number of coroutines that cannot be scheduled. Currently, most IO operations are carried out through Netty, JDBC, HttpClient, etc., hence, it is rare to encounter self-written blocking JNI, nevertheless, there are some exceptions:

  • Netty: Netty has a native-transport mode, which will directly block in Epoll JNI, see faq
  • UNIXProcess: If you use the Process interface to create a child process, the wait system call will be blocked. See exclude list for the solution

Disk IO

IO events are mainly monitored and distributed through epoll, which does not support disk file fd, hence the nonsupport for coroutines among this part of APIs. The carrier will be blocked once blocking happens while reading and writing disk files. In the follow-up, we will consider for support through io_uring.

Multi-core performance

Since most of the existing business scenarios use 4-16 core virtual machines and containers, Wisp has deeply optimized for these configurations. When the number of cores increases (such as 64 cores), a lot of resources will be wasted in terms of scheduling, which is our subsequent optimization goal.

Advanced options

The AllThreadsAsWisp exclude list

The -Dcom.alibaba.wisp.threadAsWisp.black option can prohibit some threads from being converted into coroutines, which can avoid some threads that are blocked in JNI for a long time from affecting coroutine scheduling. Three approaches are supported:

  • Thread name: start with name:, support wildcards, for example: -Dcom.alibaba.wisp.threadAsWisp.black=name:biz-thread-*
  • Package name: start with package:, wildcards are not supported
  • Class name: start with class:, match according to the type of Thread object or Runnable object

Support for combined usages, threads will not be converted into coroutines if they meet any of the conditions which are separated by ;, for example: -Dcom.alibaba.wisp.threadAsWisp.black=name:biz-thread-*;class=a.b.Foo;packge=a.b

Note: ";" is a special character of shell script, which needs to be transferred or quoted

Performance

Switching of Coroutines

Tested on a physical machine with a CPU frequency of 2.5GHz, coroutine switching can happen 14 million times per second on average. In a typical scenario with 2000qps and 10 RPCs per request on an 8-core machine, each core needs to be switched 20000 * 2 * 2/8 = 10000 times per second, every time it occupies about 1/1400 of the CPU, less than one-thousandth.

Web Framework Benchmarks

Techempower Web Framework Benchmarks is the authoritative test framework in the webserver field. Wisp mainly optimizes IO-intensive scenarios, hence, we use this framework for performance verification, and a 10% to 30% growth is obtained. For more details, please check the performance section of dragonwell-jdk.io

Applications for production environments

  • In complex business applications (Tomcat + a large number of Netty-based middlewares) we have achieved a performance improvement of about 10%.
  • In middleware applications (database brokers, MQ, etc.), we get about 20% performance improvement.

Stability

Wisp has undergone 4 years of R&D iteration. There are about 100 thousand instances(containers, etc.) running in the production environment for a long time, and it has been in active development for a long time, hence high stability.

FAQ

Q: Why the scheduling overhead of coroutines is smaller?

A: We have always emphasized that coroutines are suitable for IO-intensive scenarios, which means that usually tasks will be blocked on IO events for a short period of time, and then scheduled. In this case, as long as the system's CPU is not being fully occupied, using a simple FIFO(First-In First-Out) scheduling strategy can basically ensure a fair scheduling. At the same time, we use a completely lock-free scheduling implementation, which considerably reduces the scheduling overhead compared with the cost of kernel.

Q: Netty's native-transport blocks coroutine scheduling

A: You can also use -Dio.netty.noUnsafe=true, despite that, other unsafe functions might be affected. For netty version 4.1.25 and above, it is supported to close jni epoll only through -Dio.netty.transport.noNative=true, see 358249e5

Q: Why doesn't Wisp2 use the ForkJoinPool to implement scheduling

A: ForkJoinPool itself is excellent, nonetheless, it is not suitable for Wisp2 scenarios. For easy understanding, we can regard a coroutine wake up as an Executor.execute() operation. Although ForkJoinPool supports task stealing, the execute() is a random or a current thread queue operation (depending on whether it is asynchronous), which will cause a coroutine to be awakened in a random thread. The cost of a coroutine stealing operation is a bit high while in Wisp, so we need an affinity to keep the coroutine tied to a fixed thread as much as possible. Work-stealing only occurs when the thread is busy. We implemented the work-stealing strategy to support this feature. Judging from various indicators such as scheduling overhead/delay, the performance is basically the same as ForkJoinPool.

Troubleshooting

jstack

Wisp supports coroutine stacktrace display in jstack as the form below:

- Coroutine [0x7f227dae47e0] "http-nio--exec-5" #643 active=2394347 steal=378779 steal_fail=968 preempt=0 park=0/-1
  • http-nio--exec-5 shows the name of the thread converted into this coroutine
  • active indicates the number of times the coroutine is scheduled
  • steal indicates the number of times that work stealing has occurred
  • preempt indicates the number of coroutine preemptions
  • Stacktraces of this coroutine will be shown below
Clone this wiki locally