Fighting the Client Spaghetti Monster with Rust Traits

tl;dr: In Rust, “trait composition” are a neat way to keep code, where a lot of components come together and need to be piped up, clean and avoid spaghettification.

Introduction

A major part of my almost two decade long career in programming has been spent working on “SDKs” in Rust. By which I mean building and maintaining complex systems as libraries used by other developers to implement applications on top of. I did this back at Immmer (now defunct), for Parity with Substrate Core/Client as well as its inner on-chain application SDK to the matrix-rust-sdk and last but not least at Acter for the Acter App and then the Zoe (relay) system.

For a while, but especially during latest iteration, I have been wondering about that highest layer architecture. How to design that client, where all these subcomponents are piped together. How to design it in a way that stays flexible for yourself as well as others, yet robust and ideally testable. How to avoid spaghettification of the client, even if the underlying components are complex trait-based systems themselves.

As we have to cover a lot of surface area itself, I will not be discussing trait themselves too much – check the corresponding chapter in the excellent Rust book, if you are looking for that – but assume you have an understanding of traits, trait bounds and have implemented them in Rust. I will throw around some almost-real code and examples without asking and expect the reader to be able to parse and understand them without much help. As I want to focus on the higher level “how do we use this”-architecture perspective.

Traits in SDKs

As with any big task, the best way to tackle it is by splitting them into smaller, manageable tasks and implement these one by one. The same is true for building up large SDKs. Often times they contain various components, like a storage layer; network or communication components; some internal state machine for the actual domain specific logic; and maybe some developer-front facing API or even UI components. To make implementing more manageable, it is common place to split them up into the separate independent components, sometimes even as separate crates, and provide an outer interface.

In the SDK world you often find that these components internally need to be plugable themselves though. Like a storage component might be implemented with an embedded SQLite for mobile Apps, with some SQL-backend-service or NoSQL-Database on the Server and with IndexDB in the Browser (with Wasm). Generally, the outer composed system doesn’t really have to care which of these is being used and thus it can be up to that component to define that. A common way to provide this abstraction is by defining a trait for that lowest layer and have these various specific parts implement them. Then the higher layer and also the layers on top can focus on their specific side of things.

This also nicely allows for these implementations that come with their own implementations to be only pulled. Or only compile for the targets that actually use them, as well as introduce new implementations via feature-flags gradually into production. It’s a pretty neat way of organizing the code. In the Matrix SDK we have that layer for implementing storage for example, and though not strictly because of the trait, the SDK even provides a macro to generate the entire test suite against your custom implementation that you can use.

To the mock

Having these traits brings in another nice benefit: Mocking. As the higher level components might have their own logic (like caching or ordering or something) testing often requires to set up the lower level component(s) as well. If instead, you defined that interface in a trait, you can implement various Mock-types to test a range of scenarios for your functions and focus on this specific logic. What sounds tedious at first becomes a breeze with the help of crates like mockall. It’s a lot easier and often faster than setting up that lower level layer just to test that the component pulls the objects from the store and returns them sorted regardless of the underlying order.

Middleware-ing

Similarly, by having the traits define the interfaces, you can add functionally nicely in a middleware-kinda fashion similar to what is done many web servers. Think of a caching layer on top of the database as an example. That caching layer can wrap anything implementing the trait while also implementing the trait itself. That way you can implement a LRU cache or something, regardless of the underlying storage types. As the interface is just the same trait again, you can mock the lower layer, ensuring a good test coverage on exactly what this layer does. Further you can just plug this “middleware” into the higher level layer without any further changes. This is how we implemented a storage layer for the Rust SDK that splits off media storage (before that was added to the SDK itself) and keeps them at different path (in the mobile’s “cache” directory), for example while passing along everything else to whatever inner database system was being used otherwise (e.g., SQLite).

But specific, sometimes

Now, for the traits you only want to expose the common interface of course. But specific implementation sometimes still have APIs to fine tune or configure certain things - like the path for the sqlite database. You don’t want to put these on the traits as they are implementation specific and pointless for other implementations. But as traits are implemented on specific types, your concrete types can still add these helper functions and as the higher level API / SDK you often just use feature-flags to then expose them or not.

Composing over many traits

Now that you understand the complexity and usage of these subcomponents, think about how you tie them all together in the Client. This needs to connect these components, move messages from one component to another, for e.g. to get that messages that just came in from the network to the internal state machine. And a results from the state machine which triggers the storage layer to persist some of these changes. Of course you want the client to be as flexible over the specific implementations as possible – most of that higher level code doesn’t really differ whether the message comes from LoRa, over QUIC or libP2P. It doesn’t matter to the client whether it will be stored in an SQlite database or IndexDB either.

But at times you have interdependencies, so the Rust compiler need to make sure that the type that the network layer message returns is the one that the state machine accepts. This is where things often spaghettify.

At the beginning that feels reasonable, but over time it grows, and the more things are pluggable, the more generics you need to add. The client needs one generic, then another, then another… Moving from single letter to entire words, running out of words. Sooner than you think it becomes incomprehensible to follow. Not even mentioning that ever increasing tree of trait bounds you have to keep around everywhere you expose that client. Which is your main external API surface area, so you expose it a lot. Brave are those, who then need to add another bound (like Send) to any of the lower traits…

“There must be a better way”, you think to yourself …

The three paths of the enlightenment

As always, you have a few options with its various benefits and trade offs to manage this nicer. You can Box<dyn Trait> it, use type aliases or compose a Trait with associated types. Let’s look at them one by one, in order of increasing complexity.

Type alias

The first thing that probably comes to mind, is alias some of the types definitions to make it a bit cleaner. So you’d still have some components that are generic of some sub traits struct GenericStateMachine<S: StateT, M: MessageT> that implements most of the concrete logic, but then for the production environment you have an alias type NativeClientStateMachine = GenericStateMachine<NativeState, TcpMessage>; that you could use.

Depending how you organize your code, the final client could really end up being a type NativeTcpClient = GenericClient<NativeClientStateMachine, NativeClientStorage, TcpProtocol>; itself. And you could even have a builder that depending on the target returns one or the other type, but both have the same API implemented via the traits.

impl Builder {
    #[cfg(target_arch = "wasm")]
    pub fn build() -> Result<WasmClient>{
        //
    }

    #[cfg(not(target_arch = "wasm"))]
    pub fn build() -> Result<NativeTcpClient>{
        //
    }
}

impl GenericClient<StateMachine, Storage, Protocol> {
    pub fn state_machine(&self) -> &StateMachine {
        //
    }

    pub fn storage(&self) -> &Storage {
        //
    }
}

Giving you all the benefits of having the concrete types, including access to the actual types, so the consumers code could even do implementation specific calls and its compile would fail if they tried to do that against a type that doesn’t implement those (e.g. because they picked a different target arch). Of course this only works as long as the compiler doesn’t force you to specify which exact type you are expecting but can still infer that itself.

However, you end up with rather lengthy type alias lists you need to manage, especially if you do the wrapping of middlewares I described before, which can be hard to parse and follow, just check this ZoeClientAppManager, which itself wraps a bunch of aliases.

pub type ZoeClientStorage = SqliteMessageStorage;
pub type ZoeClientSessionManager = SessionManager<ZoeClientStorage, ZoeClientMessageManager>;
pub type ZoeClientGroupManager = GroupManager<ZoeClientMessageManager>;
pub type ZoeClientAppManager =
    AppManager<ZoeClientMessageManager, ZoeClientGroupManager, ZoeClientStorage>;
pub type ZoeClientMessageManager = MultiRelayMessageManager<ZoeClientStorage>;
pub type ZoeClientBlobService = MultiRelayBlobService<ZoeClientStorage>;
pub type ZoeClientFileStorage = FileStorage<ZoeClientBlobService>;

Navigating this tree isn’t easy. Especially when debugging you can easily end up at the wrong layer and wonder why your changes aren’t showing up.

`dyn Trait`s

A common idea that might come to mind is to wrap the specific implementation in a new type that holds it internally in a dyn Trait, if the trait can be made dyn compatible (formerly known as “object safety”). In practice the type most likely must be wrapped in either Box, Arc or similar - if that is what is happening already anyways then this might not be a problem. If dynamic dispatching is not too much of an overhead, this could be a viable solution.

This is exactly how the Matrix Rust SDK implements the storage layer: by wrapping the specific implementation into a Arc<dyn StateStore> and then exposing a StateStore interface without any generics.

But dyns come with another drawback: the compiler forgets all notion of the concrete type. While this can be cheaper in terms of code size (as generic functions aren’t repeated for each type), it also means that our specific type “is gone”. Any other methods that this type implements outside of the trait become inaccessible. In the Matrix SDK for storage, that seems to be acceptable, as the only implementations specific tuning happens in the builder setup before it is passed to the StateStore.

But something as simple as getting implementation-specific configuration parameters returned from that type at runtime is now impossible, even if the type in question implemented it and it can be asserted that the type is the one.

Trait Composition

If dynamic dispatching isn’t feasible or the specific types needs to still be available, that alias list grows too long and becomes to tedious to update, you might come up with: a trait combining all the types – I call them composing trait. Rather than having a generic client with an increasingly growing list of generics, you define a trait that defines the specific types via associated types. This is what we have been doing in the Parity SDK and on-chain wasm state machine.

The idea is to create a new trait Configuration that defines all the requirements as associated types and have a client only reference that trait now. It can still return aliased or sub-types that are generic, but are then for that specific configuration. Like this:

pub trait Configuration {
    type Storage: StorageC;
    type Network: NetworkC;
    type StateMachine: StateMachineC;
}

impl<C> Client<C: Configuration> {
    pub fn state_machine(&self) -> &GenericStateMachine<C::StateMachineC> {
        //
    }
    pub fn storage(&self) -> &GenericStorage<C::StorageC> {
        //
    }
}

Unfortunately, in reality this is rarely as clean. Often you find yourself needing to define the interdependencies as well. For example: the network needs to give you a specific MessageT that the state machine also actually understands. Even if you use a trait here, the compiler will enforce that you use the same type. As a result, you end up with even very low-level trait definitions popping up on your highest level configuration so that you can cross reference them via the associated types:

trait MessageT: Sized {}
trait StorageC {
    type Message: MessageT;

    fn store(&self, message: &Self::Message) -> bool;
}
trait NetworkC {
    type Message: MessageT;

    fn next_message(&self) -> Option<Self::Message>;
}
trait StateMachineC {
    type Message: MessageT;
    type Storage: StorageC<Message = Self::Message>;

    fn process(self, message: &Self::Message);
}

trait Configuration {
    type Message: MessageT;
    type Storage: StorageC<Message = Self::Message>;
    type Network: NetworkC<Message = Self::Message>;
    type StateMachine: StateMachineC<Storage = Self::Storage, Message = Self::Message>;

    fn network(&self) -> &GenericNetwork<Self::Network>;
    fn storage(&self) -> &GenericStorage<Self::Storage>;
    fn state_machine(&self) -> &GenericStateMachine<Self::StateMachine>;
}

Nice, and clean, but you can already see how it will become more complex when these traits grow in complexity. In particular when you have to do changes to some of them, it ripples through the entire system quickly with rather hairy and complex bounds that are failing in very verbose error messages. Let’s just add an ErrorT type that our client might yield, when any of the inner yield an error. So the client is meant to wrap all the inner types. We add

trait ErrorT {}

trait StorageC {
    type Message: MessageT;
    type Error: ErrorT;
//.. to all types
}

// and on the config:
//
//
trait Configuration {
    // ...

    // gee, this is verbose...
    type Error: ErrorT +
        From<<Self::Storage as StorageC>::Error> +
        From<<Self::StateMachine as StateMachineC>::Error> +
        From<<Self::Network as NetworkC>::Error>;
}

It’s a bit verbose, but reasonable overall. It becomes more tricky when you actually try to implement these types as you need to make sure all the types also match up correctly. That way we are able to reduce the generics on client from many to just one. Nice. But dragging around this massive Configuration is a pain, especially for the mock-test-ability as we described before, as we have to mock all the associated types, creating a lot of glue code.

So instead, what I end up doing is have anything with actual logic still be referring to the generics directly, so you can mock and test these specific ones, and have the final Client<C: Configuration> just be a holder that then passes along to the specific internal type with the associated types passed in as generics.

In practice it can become even more tricky if you have some of these configuration on several layers. Like in the Parity Substrate Codebase, to allow all clients to build on reusable CLI tooling there is a Service that can construct your client. That service requires a Configuration for Network and alike, but only a subset of what a Full Node needs and as result, that second needs to be a super set of the first. But that is a really advanced scenario, and if you have any good ideas to improve that situation, I am all ears.

Conclusion: Combined Composition

As so often, enlightenment isn’t picking one solution but combining wisely.

What you probably end up doing is a combination of these compositions types. Like in the Rust Matrix SDK, where in a lower level, the plugable storage is then held via a dyn Trait, while on a higher level, you might compose a client with an “trait composition” that allows any other (rust) developer to plug and replace any of the components as they please, including yourself for platform or target specific implementations.

By keeping any actual logic in the separate components with specific traits for easy mocked testing and using the “client” merely as the place were all these pipes come and plug together, you can rely on the compilers type checks as a means to ensure the correctness of the types being piped, while you have the mock tests for all the actual logic. And integration tests should cover the end-to-end functionality of the client regardless.

To wrap things up nicely, you can hide that Client<C> inside a type alias that itself is held by a struct FfiClient(NativeClient); on which you expose a completely typed no-generics rust-external API. Put on a bow and ship it :) .

Credits: Photo taken by Gabriel (who is available for hire) and published on unsplash.com under a free license