Fiberus - a new Haxe target with a native, fiber-based runtime for multiprocessing

Hi! Everybody seems to have fun with AI so I decided to chime in & speedrun one of my old, crazy ideas:

Enter Fiberus, a native Haxe compiler target with a fully fiber-based runtime. It is designed specifically for workloads that can benefit from native, lightweight, cooperative concurrency using fibers/coroutines combined with a work-stealing scheduler and tight GC integration.

Fiberus exists to let you write clean, idiomatic Haxe code while enjoying cooperative concurrency that feels like a natural part of the language and runtime - without the friction of high-level async/await constructs that always seemed like an afterthought layered on top in other languages and runtimes.

Eventually I wanna to use this to build all kinds of things, from web-services(see GitHub - fiberus-hx/warp10 ) to game-engines (maybe fiber-cortex? :P).

It’s all alpha software, only Linux is supported and there is little documentation right now. So you better expect some nice explosions!

Github: fiberus-hx Ā· GitHub
Discord: Project Post on Discord

9 Likes

Really nice :wink:

The concurrency model you’ve chosen, it essentially blocks the current fiber on IO, right? Are there primitives that allow firing off multiple blocking calls in parallel, i.e. something roughly equivalent to const [users, posts] = await Promise.all([getUsers(), getPosts()]) in JS?

I’ve also kinda grown tired with the noise from async IO, so solutions that try to get rid of that are very appealing. But then again I wonder if the noise is not a feature? It signals to me clearly that I’m yielding control, potentially for hundreds of milliseconds, and that this is the place where race conditions are most likely to creep in. Or do you have plans for something like actors or some other approach to deal with this?

Anyway, super cool. Can’t wait for the Windows version :smiley:

1 Like

Hey, thanks for the feedback!

A Fiber can yield at different points like at function entries and in for/while-loops in cases where there are other fibers waiting in the current thread’s queue/inbox(strong fairness!). In cases of IO we use io_uring to have OS level async operations on the scheduler level (we never wanna block a scheduler thread) and a fiber requesting an operation will suspend execution till there is a wake up to continue.

Spawning & syncing fibers can be done with a simple counter. I plan to add more utility/sugar around that(e.g. channels).

But then again I wonder if the noise is not a feature?

Valid point. ā€œcolorlessā€ async can feel weird. But I think it’s fine since you should be aware of your data access patterns/mutability in any case and for debugging there is heavy instrumentation and profiling via Tracy-Profiler available. If it really turns out to be a problem there are multiple ways to address this (e.g. like async/await macros).

We’ll see what happens. :smiley:

Ah, ok … so essentially one would spawn one fiber per parallel op and then suspend until all of them are done. Neat :slight_smile:

It’s not that it feels weird … to the contrary: it’s what feels natural. And it’s super convenient too, until multiple concurrent processes / threads contend over shared resources :smiley:

I just remember when I got excited about async/await in JS only to realize that my problems didn’t go away, only started look nicer. So even though the flow of the code becomes more easily understood with await, I learned that it’s also a marker for where things can go wrong - especially on a server where you may have to deal with thousands of concurent requests. Hence my paranoia :wink:

Up until there’s indirection. A function type that returns a plain result (or an interface with a method that returns a plain result) essentially means that the code is going to return synchronously, at least in runtimes where threads are isolated (JS) or cannot run in parallel (Python with GIL). Annoying as coloring may be, it does mean that you can make sync vs. async a contract. With blocking vs. non-blocking, it’s trickier. Anyway, I’m not even suggesting you should do anything about this. Might wind up as annoying as checked exceptions. I was just wondering if you had any synchronization approaches in place (or even in mind) that are not so loudly advertised ^^

Ok, my Fibrix GC is no longer full STW mark-sweep, instead we run a concurrent mark-sweep with SATB-Barriers (go-like). Things point into the right direction!

Oh! I’m super thankful about every bit of feedback I get! No, I really mean it, we will have to see. I literally have not built anything big yet and it might be very well the case that I missed something that invalidates my mission directly. Will find that out sooner than later :smiley:

2 Likes

Ok, so the more I tested and the more complex the applications became (beyond the thousands of tests), it became pretty clear that the Immix-inspired GC was either super fast or super slow. The whole STW and synchronization kungfu really dragged this down and made me question my architecture.

Good news is that I found several interesting research papers and that gave me some ideas for a complete overhaul. We have now a MaPLe-inspired Hierarchical Heap GC (per-fiber heaps, depth-based write barriers, Cheney copying collector, TypeDis extensions, …) and we no longer have to pay the costly price of waits & pauses.

It actually works soo well, that I dont quite trust it atm. But still happy to share the progress here. Take it with a grain of salt; Seems like we play in the league of Rust frameworks now?! :stuck_out_tongue: I guess I will setup the TechEmpower Benchmarks soon to get a clearer picture.

Here you can see the distribution of collections:

4 Likes

I’m really curious to see the performance boost this will bring to my backend framework, HxWell.

2 Likes

Really nice. The number’s look awesome :wink:

I wonder if something like Nim’s ā€œOptimistic Reference Countingā€ might be worth looking into: In a nutshell it uses automatic reference counting (not unlike Swift’s) with a cycle collector: Introduction to ARC/ORC in Nim - Nim Blog . It’s basically a bet on the idea that most object graphs are cycle free. Plus it’s my understanding that they also avoid the overhead of atomic RC for thread local objects, which your hierarchical heap might lend itself to.

Then again your numbers look so good it’s quite possibly not even worth the hassle ^^

Michael, it’s really impressive benchmark сompeting with rust web frameworks. Looks like we have a talented engineer who could create a performant haxe target we’ve never seen before:)

@dazKind Can you tell us a bit more about fiberus architecture? Does it use hxcpp target or generate c/cpp code via haxe macros?

Thanks for the love, but you give me way too much credit. Im not talented, just a little relentless :smiley:

In essence it’s a new a compiler target (similiar to hxcpp, in fact I use the same buildXml system) that emits C-code. The compiler transforms the Haxe AST to a C AST & code emission with some tweaks (e.g. escape analysis, shadow stack frames) to play into the nature of fibers. The runtime is built around a central lock-free scheduler + work stealing + work submission and fast IO (io_uring).

Some years ago I watched https://gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine and fell in love with the idea. But raw C & memory management sucks and I wondered if this could work with Haxe using a special GC.

But I never had the time available to start or even think of making it work (like I would have to learn and master Ocaml first). That changed with the advent of AI Tooling. I was finally able to make this a multi-month long experiment in my free time instead of a multi-year full-time endeavor.

If you wanna know more about fibers and the nitty-gritty low-level details, I can recommend reading the code on GH or the amazing blogpost Fibers, Oh My!

3 Likes

Are you planning to create a merge request of fiberus haxe version into official branch? @Simn what do you think about it?

Sorry for the late reply, I wanted to look into the linked material before I open my mouth:

Pretty interesting material. Personally I dont like refcounting at all. inc/dec on almost every pointer? It looks simple but it is easy to miss an edge-case and you end up with super weird behavior that is hard to catch esp in long running apps. So I skipped it as a concept from the start.

The way I understood it Nim’s ARC/ORC is an excellent single/shared-heap ref-counting system. Yet, I feel it would invalidate most of the advantages we gained from HH+Disentaglement that play into the strength of our fibers:

  • near-zero syncs for GC between concurrent fibers as disentanglement holds. There is some work when it comes to heap joins but otherwise that’s cool. This enables almost linear scaling of GC work with fiber count! We can run a million fibers on full blast with very little contention.
  • Provable task-local allocation and collection freedom (obliviousness). No shared heap, fibers dont care or need to know about other fiber’s allocations. This is were HHs/MaPLe really shine.
  • Static barrier elision: I only have this partially working atm(stack objects), write barriers need a little more work though. But what that means is that the compiler can statically prove that many writes cannot create down-pointers/old-young-pointers, eliminating a ton of expensive barriers.

So yeah, the numbers look good so far. And they can only go down as things develop, the buffer we have seems ok :smiley:

Once I have built some more serious projects and gained more confidence I will definitely offer it as a MR/PR.

1 Like

The idea of ARC, both in Objective-C / Swift (where it originated) and Nim is to actually avoid refcountring whenever possible. Proponents claim, this is possible for the the majority of cases. For the Apple stuff, I have it on good authority that it tends to work as advertised. The Nim people make the same claims for their solution:

ARC

ARC is Nim’s pure reference-counting GC, however, many reference count operations are optimized away: Thanks to move semantics, the construction of a data structure does not involve RC operations. And thanks to ā€œcursor inferenceā€, another innovation of Nim’s ARC implementation, common data structure traversals do not involve RC operations either! The performance of both ARC and ORC is independent of the size of the heap.

– ORC - Vorsprung durch Algorithmen - Nim Blog

What’s kinda interesting about ORC specifically is that it should work out of the box with arbitrary Haxe code, whereas mere ARC would leak cyclic object graphs.

It looks like you’re getting pretty close to the practical performance limits, which is why I thought I’d bring this up - after all it does sound great on paper. But honestly, I cannot promise it would even produce measurable improvements :wink: