ZmnSCPxj » CLBOSS » Overall Architecture

Updated: 2022-04-12

Introduction

This writeup serves not only as a description of the overall architecture of CLBOSS, the C-Lightning Automated Node Manager, it also serves as an introduction to this entire series of writeups about CLBOSS.

First, notice that this document has an “Updated” line at the top. Generally, I think most of the architecture of CLBOSS is unlikely to change, but it can change over time, and this document might thus become dated. Thus, the “Updated” will be present on all documents in this series. While I do not expect CLBOSS to change much in the meantime, if you are reading this several years after the “Updated” date, you might want to take this with some random number of grains of salt.

I intend to write out a few more documents after this, over the next few weeks or months. So, if as of now you are seeing just this document, stay tuned for more.

In this particular writeup, I will describe the overall architecture of CLBOSS. This document is fairly technical, as it mostly describes software architecture, which is generally of interest only to other programmers. However, I hope laymen may glean some more understanding of the overall architecture and some of its advantages and why CLBOSS is structured this way.

TLDR

The rest of this writeup is significantly more technical, though the other writeups in this series should not be as technical as the below.

S::Bus

namespace S { class Bus; }

At the very core of CLBOSS is a simple centralized signal bus, S::Bus. This bus is really just a massive centralized publish-subscribe pattern, intended for use in a single process rather than across processes or machines.

As a message signalling bus, various clients of the S::Bus can .subscribe() to particular message keys, providing a function to be executed. When a message (which must have exactly one message key) is .raise()d, all functions registered to that key will then be invoked, but none of the other function will.

CLBOSS is composed of modules, and each module is “attached” to the bus. By “attached” I mean that an instance of the module class is constructed, and given a non-const reference to the central S::Bus. The module constructor can then subscribe to particular messages on the bus, and keep a reference to the bus so the module can publish its own messages. The module itself is primarily composed of code and data that the module needs in order to handle its task.

Messages are keyed by C++ type using type introspection with C++ std::type_index. Thus, messages can be of any arbitrary C++ type (both built-in or class-based), the only requirement imposed is that the type is moveable in C++11 terms (it need not be copyable).

template<typename T>
void can_be_raised_on_s_bus(T& t) {
        (void) std::move(t);
}

For CLBOSS specifically, the messages sent over the bus are always passive data with only public data members, not active objects with function members. However, some messages do have data members of type std::function, though the typical intent is that the function itself is the data to be passed around with the message, and does not encapsulate the details of the message, only its own captured variables.

As a convention, CLBOSS modules are placed in the namespace Boss::Mod (i.e. all modules have a prefix of Boss::Mod::). Some generic classes that are useful across multiple modules are placed in the Boss::ModG namespace, the G meaning “generic”.

Data structures intended to be used as messages are placed in the namespace Boss::Msg. Such data structures are plain struct-like classes with no function members and only public data members.

This central bus makes CLBOSS easily modular. The bus does not require that particular modules be installed or not installed. New modules can be added with little impact to other modules that have other responsibilities. Old code can be easily removed by simply removing the module, again with little impact to unrelated modules. Modularity means also that behavior can be modified or changed by injecting messages into the bus.

Execution

namespace Ev { template<typename a> class Io; }

The templated class Ev::Io<a> is used to represent pending executable code which will yield the templated type a later when execution completes.

Ev::Io<a> is a CPS monad, in Haskell terms (and is named after the Haskell IO a type). In JavaScript terms it is a Promise. In C terms, it represents a function that accepts a callback, and is tied to some context pointer for that function.

It uses a syntax inspired by the JavaScript Promise type: the .then() method / member function. The .then() is equivalent to >>= in Haskell. For an object of type Ev::Io<a> with any type a, the .then()/>>= will combine that object with a function that accepts a plain type a, and returns an object of type Ev::Io<b>, where b can be any type (and can be the same as, or different from, a.

The result of the .then() combinator is of type Ev::Io<b>, and executing the result is equivalent to:

  1. Executing the original Ev::Io<a> object.
  2. Extracting the resulting a (in callback terms, getting the resulting a via an argument to the callback).
  3. Executing the given function, which returns an Ev::Io<b>.
  4. Executing the resulting Ev::Io<b>, and yielding the resulting b.

So how do we “execute” an Ev::Io<a>? We invoke its .run() member function, which requires a callback (which requires a single argument of type a and returns nothing). When the Ev::Io<a> finishes execution, then the callback gets invoked with the result.

At its heart, this is a CPS monad, i.e. this is basically a nice syntax for C code using callbacks. If you have seen some of the more complex C plugins built into C-Lightning, such as multifundchannel, then you know how much a PITA callback-using style is to code in C. The .then() syntax together with C++11 lambda functions, makes it look significantly nicer: there is no need to write multiple short file-local functions with their own long boilerplate to declare.

This is important since callback-using style (i.e. continuation-passing style or CPS) allows greenthreads to be implemented in userspace. Greenthreads are like threads, but context switching is cooperative rather than preemptive, and launching a new thread is cheap — it is just an Ev::Io<a> having its .run() member function invoked, and does not involve a context switch from userspace to kernelspace, having the OS update its thread tables and allocating a fresh, and large, C stack. Greenthreads are thus lightweight. In fact, some higher-level languages use greenthreads in their runtime for concurrency, and encourage programmers to launch new greenthreads for all tasks.

CLBOSS makes extensive use of multiple parallel greenthreads running at the same time. All greenthreads in CLBOSS run in a single process-level thread, the main thread of the CLBOSS process.

By using greenthreads, we avoid getting bogged down by the heavy weight of actual preemptive OS-level threads. Preemptive OS threads are not only heavyweight, but also require proper mutex usage — all greenthreads run on the same main process thread, so do not require mutexes, as long as you do not run an Ev::Io<a> that blocks and returns to the main loop, you have exclusive access to all memory and variables will not change out from under you.

Another advantage is that it makes integrating into event loops much neater. Suppose we want to wait for an input to arrive on some pipe or socket. If we were to read() directly, then the entire main thread blocks and the rest of CLBOSS will stop executing. However, because Ev::Io<a> accepts a callback, we can instead make a read()-like function that results in an Ev::Io<a>. That Ev::Io<a> object will then take its callback, and register it into the event loop as “waiting for this pipe / socket to be ready for reading”, and then return without invoking any callback. This causes execution to resume back to the top-level main event loop, which then handles the new registration and adds it to its select() or poll() or whatever. Then when the event loop triggers the waiting event, the callback is invoked, and the rest of the greenthread resumes execution. The main event loop can thus handle large numbers of greenthreads executing simultaneously.

This style allows easily coding CLBOSS as if everything were blocking instead of asynchronous, even though at the low level we are actually treating everything as being asynchronous. The Ev::Io<a>, plus some small amount of glue code to connect to the main loop, serves as a bridge between our apparently blocking multithreaded surface syntax, to our actual asynchronous non-blocking single-thread event loop implementation.

CLBOSS uses the libev library for its main loop, hence why the Ev::Io<a> type is in the Ev namespace. However, the actual type itself does not strictly require libev and can be trivially adapted to any event system main loop library. Aspiring C++ template metaprogrammers should check out how it implements type introspection of function return types, too (warning: template metaprogramming can drive you insane, learn at your own risk, Ph'nglui mglw'nafh template metaprogramming R'lyeh wgah'nagl fhtagn).

CLBOSS is an automated node manager, and a node manager needs to do many things, and even if it fails at one of those things, it should still try its best to handle all the other things we expect it to do. Greenthreads makes it easy for CLBOSS to handle many things, and exceptions are specially handled so that an exception thrown in one greenthread does not bring down other greenthreads. This makes CLBOSS resilient against failures in one of its tasks; other tasks we have for it remain operational.

Oracle Modules And Compute Modules

Roughly speaking, we can classify CLBOSS modules into two large categories:

  1. Oracle modules connect to the outside world and cause messages to occur on the bus in reaction to outside-world events, or respond to bus messages by manipulating the outside world.
  2. Compute modules only “think”, they listen only to on-bus messages, possibly store some local state (or possibly on-database state), perform a computation, and emit messages on the bus depending on what the module is intended to do.

(OOP: There are no base classes for modules, as I prefer composition over inheritance. Thus, there are also no base classes for oracle or compute modules. Whether a module is oracle or compute is based on whether it is possible to test it only by attaching it to the message bus (i.e. compute) together with dummy modules, or if we need to somehow emulate some real outside world (i.e. oracle).)

An example of an oracle module is Boss::Mod::Timers. This module simply waits for certain set amounts of time, and then launches a new greenthread to raise a timer message. This module is responsible for emitting the messages:

A lot of “oracle” modules actually just listen for one of the timer events, then execute RPC commands to the managed C-Lightning node in order to check its status. Then based on the status of the C-Lightning node, the module then emits messages reporting the status of the C-Lightning node.

A new greenthread is launched just to broadcast each of the messages. This makes the Boss::Mod::Timers module robust, as any exceptions caused by code triggered by the timers will not cause the timer loops to crash, so that the timer will always raise the message on the next scheduled time. This makes CLBOSS resilient against temporary failures that were not handled properly by my code; if something triggered by a timer fails now, then later the Boss::Mod::Timers will re-trigger it again, hopefully with the failure already resolved, and other parts of the system will continue to run and operate normally.

There are random hourly and daily messages; for example, the random hourly is broadcast between 30 minutes to 90 minutes from the previous broadcast. The intent is to be resilient in case of observable time-related behavior that might be exploited by sentient nodes / node managers in attacks. No known attacks exist, but it helps to be prepared for them by making at least time-based attacks at least harder by having some randomization. The 10-minute timer is always fixed duration between broadcasts, and is intended to be used for monitoring the node, while changes to the node state (feerates, rebalances, etc.) are triggered by the random hourly and/or random daily.

Having a separate “compute” category makes testing of algorithms much simpler, in combination with the central bus architecture. A test of a single compute module involves just instantiating the S::Bus together with the module under test, plus some dummy modules that emulate the other modules the compute module would interact with. Since the oracle modules can be replaced, it is easy to provide arbitrary stimulus to the compute module under test, broadcasting arbitrary messages and checking for arbitrary responses from the module under test. In short, this just showcases the advantage of the dependency inversion principle: instead of compute modules depending directly on oracle modules, or on other compute modules, they all depend on a common interface, the central message signalling bus. The central bus then eases testing of the module, and replacing its dependencies with alternate versions.

Testing the oracle modules is not as straightforward, since we would need to somehow emulate the outside world that the oracle modules talk to. However, it is acceptable, in this context, to use “testing on the field” i.e. just test oracle modules by running CLBOSS on a live system. Oracle modules tend to have much, much simpler code, they are effectively “just” translators between the outer world and the inner S::Bus world. If something happens on the node that is supposed to be noticed by the oracle module, and it does not emit a message (i.e. none of the already-well-tested compute modules react to the event) then we know it is the oracle module at fault, and scanning through the code of the oracle module is usually sufficient since the oracle module is typically a straightforward translator.

Non-idealities

The described architecture was developed during the development of the actual CLBOSS program itself. This means that some code in CLBOSS was written before some of the architecture had been solidified.

For example, the central bus, and the dependency inversion it represents, implies that modules should not have references or pointers to other modules. Modules should have references or pointers to (i.e. depend on) only the abstract bus, not on any concrete modules.

However, the Boss::Mod::Rpc module was written before this part was extended to all parts of the overall design.

The Boss::Mod::Rpc module exposes the .command() member function, which is accessible only via a direct reference to the actual module. Thus, modules that want to send RPC commands to the C-Lightning node being managed have to acquire a pointer to the RPC object.

A better design would have messages to initiate an RPC command, and a message to represent the RPC command result, and have the Boss::Mod::Rpc instead listen for those messages over the bus. The convenience of the .command() member function can be reacquired by implementing a “proxy” in the Boss::ModG namespace; any RPC-using module could then instantiate this proxy, which would provide a .command() member function that emits the do-this-RPC-command message and listens for the corresponding result.

This would make some “oracle” modules into “compute” modules, since they can now be tested independently of the real Boss::Mod::Rpc, due to the dependency decoupling. It would also allow other modules to introspect on what commands are being sent by the rest of CLBOSS, by simply listening for the RPC messages on the bus.