Hi! Everybody seems to have fun with AI so I decided to chime in & speedrun one of my old, crazy ideas:
Enter Fiberus, a native Haxe compiler target with a fully fiber-based runtime. It is designed specifically for workloads that can benefit from native, lightweight, cooperative concurrency using fibers/coroutines combined with a work-stealing scheduler and tight GC integration.
Fiberus exists to let you write clean, idiomatic Haxe code while enjoying cooperative concurrency that feels like a natural part of the language and runtime - without the friction of high-level async/await constructs that always seemed like an afterthought layered on top in other languages and runtimes.
Eventually I wanna to use this to build all kinds of things, from web-services(see GitHub - fiberus-hx/warp10 ) to game-engines (maybe fiber-cortex? :P).
Itās all alpha software, only Linux is supported and there is little documentation right now. So you better expect some nice explosions!
The concurrency model youāve chosen, it essentially blocks the current fiber on IO, right? Are there primitives that allow firing off multiple blocking calls in parallel, i.e. something roughly equivalent to const [users, posts] = await Promise.all([getUsers(), getPosts()]) in JS?
Iāve also kinda grown tired with the noise from async IO, so solutions that try to get rid of that are very appealing. But then again I wonder if the noise is not a feature? It signals to me clearly that Iām yielding control, potentially for hundreds of milliseconds, and that this is the place where race conditions are most likely to creep in. Or do you have plans for something like actors or some other approach to deal with this?
Anyway, super cool. Canāt wait for the Windows version
A Fiber can yield at different points like at function entries and in for/while-loops in cases where there are other fibers waiting in the current threadās queue/inbox(strong fairness!). In cases of IO we use io_uring to have OS level async operations on the scheduler level (we never wanna block a scheduler thread) and a fiber requesting an operation will suspend execution till there is a wake up to continue.
Spawning & syncing fibers can be done with a simple counter. I plan to add more utility/sugar around that(e.g. channels).
But then again I wonder if the noise is not a feature?
Valid point. ācolorlessā async can feel weird. But I think itās fine since you should be aware of your data access patterns/mutability in any case and for debugging there is heavy instrumentation and profiling via Tracy-Profiler available. If it really turns out to be a problem there are multiple ways to address this (e.g. like async/await macros).
Ah, ok ⦠so essentially one would spawn one fiber per parallel op and then suspend until all of them are done. Neat
Itās not that it feels weird ⦠to the contrary: itās what feels natural. And itās super convenient too, until multiple concurrent processes / threads contend over shared resources
I just remember when I got excited about async/await in JS only to realize that my problems didnāt go away, only started look nicer. So even though the flow of the code becomes more easily understood with await, I learned that itās also a marker for where things can go wrong - especially on a server where you may have to deal with thousands of concurent requests. Hence my paranoia
Up until thereās indirection. A function type that returns a plain result (or an interface with a method that returns a plain result) essentially means that the code is going to return synchronously, at least in runtimes where threads are isolated (JS) or cannot run in parallel (Python with GIL). Annoying as coloring may be, it does mean that you can make sync vs. async a contract. With blocking vs. non-blocking, itās trickier. Anyway, Iām not even suggesting you should do anything about this. Might wind up as annoying as checked exceptions. I was just wondering if you had any synchronization approaches in place (or even in mind) that are not so loudly advertised ^^
Ok, my Fibrix GC is no longer full STW mark-sweep, instead we run a concurrent mark-sweep with SATB-Barriers (go-like). Things point into the right direction!
Oh! Iām super thankful about every bit of feedback I get! No, I really mean it, we will have to see. I literally have not built anything big yet and it might be very well the case that I missed something that invalidates my mission directly. Will find that out sooner than later
Ok, so the more I tested and the more complex the applications became (beyond the thousands of tests), it became pretty clear that the Immix-inspired GC was either super fast or super slow. The whole STW and synchronization kungfu really dragged this down and made me question my architecture.
Good news is that I found several interesting research papers and that gave me some ideas for a complete overhaul. We have now a MaPLe-inspired Hierarchical Heap GC (per-fiber heaps, depth-based write barriers, Cheney copying collector, TypeDis extensions, ā¦) and we no longer have to pay the costly price of waits & pauses.
It actually works soo well, that I dont quite trust it atm. But still happy to share the progress here. Take it with a grain of salt; Seems like we play in the league of Rust frameworks now?! I guess I will setup the TechEmpower Benchmarks soon to get a clearer picture.
I wonder if something like Nimās āOptimistic Reference Countingā might be worth looking into: In a nutshell it uses automatic reference counting (not unlike Swiftās) with a cycle collector: Introduction to ARC/ORC in Nim - Nim Blog . Itās basically a bet on the idea that most object graphs are cycle free. Plus itās my understanding that they also avoid the overhead of atomic RC for thread local objects, which your hierarchical heap might lend itself to.
Then again your numbers look so good itās quite possibly not even worth the hassle ^^
Michael, itās really impressive benchmark Ńompeting with rust web frameworks. Looks like we have a talented engineer who could create a performant haxe target weāve never seen before:)
Thanks for the love, but you give me way too much credit. Im not talented, just a little relentless
In essence itās a new a compiler target (similiar to hxcpp, in fact I use the same buildXml system) that emits C-code. The compiler transforms the Haxe AST to a C AST & code emission with some tweaks (e.g. escape analysis, shadow stack frames) to play into the nature of fibers. The runtime is built around a central lock-free scheduler + work stealing + work submission and fast IO (io_uring).
But I never had the time available to start or even think of making it work (like I would have to learn and master Ocaml first). That changed with the advent of AI Tooling. I was finally able to make this a multi-month long experiment in my free time instead of a multi-year full-time endeavor.
If you wanna know more about fibers and the nitty-gritty low-level details, I can recommend reading the code on GH or the amazing blogpost Fibers, Oh My!
Sorry for the late reply, I wanted to look into the linked material before I open my mouth:
Pretty interesting material. Personally I dont like refcounting at all. inc/dec on almost every pointer? It looks simple but it is easy to miss an edge-case and you end up with super weird behavior that is hard to catch esp in long running apps. So I skipped it as a concept from the start.
The way I understood it Nimās ARC/ORC is an excellent single/shared-heap ref-counting system. Yet, I feel it would invalidate most of the advantages we gained from HH+Disentaglement that play into the strength of our fibers:
near-zero syncs for GC between concurrent fibers as disentanglement holds. There is some work when it comes to heap joins but otherwise thatās cool. This enables almost linear scaling of GC work with fiber count! We can run a million fibers on full blast with very little contention.
Provable task-local allocation and collection freedom (obliviousness). No shared heap, fibers dont care or need to know about other fiberās allocations. This is were HHs/MaPLe really shine.
Static barrier elision: I only have this partially working atm(stack objects), write barriers need a little more work though. But what that means is that the compiler can statically prove that many writes cannot create down-pointers/old-young-pointers, eliminating a ton of expensive barriers.
So yeah, the numbers look good so far. And they can only go down as things develop, the buffer we have seems ok
Once I have built some more serious projects and gained more confidence I will definitely offer it as a MR/PR.
The idea of ARC, both in Objective-C / Swift (where it originated) and Nim is to actually avoid refcountring whenever possible. Proponents claim, this is possible for the the majority of cases. For the Apple stuff, I have it on good authority that it tends to work as advertised. The Nim people make the same claims for their solution:
ARC
ARC is Nimās pure reference-counting GC, however, many reference count operations are optimized away: Thanks to move semantics, the construction of a data structure does not involve RC operations. And thanks to ācursor inferenceā, another innovation of Nimās ARC implementation, common data structure traversals do not involve RC operations either! The performance of both ARC and ORC is independent of the size of the heap.
Whatās kinda interesting about ORC specifically is that it should work out of the box with arbitrary Haxe code, whereas mere ARC would leak cyclic object graphs.
It looks like youāre getting pretty close to the practical performance limits, which is why I thought Iād bring this up - after all it does sound great on paper. But honestly, I cannot promise it would even produce measurable improvements