On starting over

(Don’t panic, I’m not personally starting over.)

How many of you have heard of the band Wintergatan?

Martin Molin, the face of Wintergatan, is a musician/engineer who rose to Internet fame after building the “Marble Machine” you see above. I assume you can tell why – that thing is really cool.

Sadly, from an engineering standpoint, the machine was a nightmare. Only capable of playing songs very similar to the one it was built for, and held together with elbow grease and wishful thinking. Marbles could collide if it tried to play adjacent notes, it was hard to time the notes properly, and the marbles kept clogging up or spilling out.

So he started over.

Marble Machine X

On January 1 2017, Martin announced the Marble Machine X, an entirely new device that would fix all the flaws in the original. Over the next four and a half years, he posted regular development updates. Even if – like me – you only watched some of the videos, you’d still learn a lot about both the MMX and mechanical engineering.

Martin went all out this time around, using CAD and CNC to build parts with millimeter precision, prototyping five different versions of a single piece so he could test them side-by-side, taking hours of video footage and building enormous spreadsheets with the data, measuring exactly how many milliseconds early or late the notes were, and taking design suggestions from a worldwide community of engineers. Most of all, he was unafraid to remove parts that he’d spent weeks or months on, if they weren’t quite working right.

It’s not really awesome to spend one and a half weeks building something that you have to redo, but I’m really used to that, and I’m actually good at starting over… I’m not so interested in this machine if it doesn’t play good music.

-Part of Martin’s heartfelt speech. (Make sure to watch the video for the rest.)

He sure did start over. Often enough that his angle grinder and “pain is temporary” catchphrase became a community meme, and then ended up on merchandise.

Was it worth it? Oh yeah. Looking at his last edited video before he switched to raw streams, the MMX ended up as an engineering marvel. Not only does it look great, it can drop thousands of marbles without error. When there is an error (5:26), he can instantly diagnose the problem and swap out the parts needed to fix it, no angle grinder necessary. Immediately after fixing it, he tried again and dropped thirty thousand in a row with zero errors. Four years well spent, I’d say!

So, just this January, it occurred to me that I hadn’t heard from Martin since that last video. The one posted all the way back in June. I didn’t mean to forget about him. In fact, I’m subscribed. Sure I skipped all the streams, but why did he stop posting edited videos?

A Lesson in Dumb Design

Wintergatan’s latest video, posted last September, has the answers. It’s titled “Marble Machine X – A Lesson in Dumb Design,” and in it, Martin discusses “dumb requirements” in the MMX.

First, make your requirements less dumb. Your requirements are definitely dumb. It does not matter who gave them to you; it’s particularly dangerous if a smart person gave you the requirements, because you might not question them enough. […] It’s very common; possibly the most common error of a smart engineer is to optimize the thing that should not exist.

-Elon Musk

Leaving aside Elon Musk himself, this seems like good advice. Martin gives an example of how it applies to the MMX at 5:49: he’s built the machine based off the fundamental assumption that marbles should always follow constrained single-file pathways. All the situations he’s encountered over the years where marbles would clog up, or apply pressure to a piece of tubing and burst out, or clog up, or jump over dividers, or clog up – all of these situations resulted from trying to constrain the marbles more than necessary.

Most were fixable, of course. He’s got well over a hundred videos’ worth of solved problems. But as he graduated from testing a few marbles per second to playing entire songs, he discovered more and more things wrong. Eventually, he concluded that the MMX, despite all the work put into it, wasn’t fixable. Now, he’s planning to produce one complete song with it, and then – once again – start over.

Judging by the YouTube comments, the community did not take this news well.

Drummers lose drum sticks. Violinists break bows. Guitarists lose picks. The marble machine can drop a marble.

-Thomas R

The MMX is literally almost complete and could be complete if only you allowed for a margin of error and stopped reading into all these awful awful self-help books.

-rydude998

“Make requirements less dumb” is a fantastic approach, but please don’t forget that “looks cool” is not a dumb requirement for your project.

-David H (referring to when Martin talked about form vs. function)

The perfect marble machine isn’t going to happen unless you seriously water down the beautiful artistic aspects that made the MMX so special to begin with. If that’s what it takes, then what’s the point? You’ll have a soulless husk of what was previously a wonderful and inspiring piece of art.

-BobFoley1337

This is the story of an artist who became an engineer to build his art and in so doing forgot the meaning of art.

-Nick Uhland

Note: If I quoted you and you’d rather I didn’t, let me know and I’ll take it down.

All good points. I’m not necessarily on board with the tone some of them took, but can you blame them? The project seemed so close, and so many people were excited to see the culmination of all this work, and then Martin pulls the rug out from under everyone.

But before we judge, let’s hear Martin’s side of the story:

I got many ideas for how to design a simpler, functional Machine, and I can’t stop thinking about it. I have heard that when you build your third house, you get it right. I think the same goes for Marble Machines.

[…]

I do know the generous response and objections that most crowdfunders have when I describe the Marble Machine X as a failure, and you are all correct. Its not a complete failure. What I have learned in the MMX process is the necessary foundation for the success of MMX-T.

[…]

If it’s hard to understand this decision let me provide some context: The MMX looks like it is almost working, but it isn’t. The over-complex flawed design makes the whole machine practically unusable. I now have the choice between keeping patching up flaws, or turn a page and design a machine which can be 10X improved in every aspect.

[Emphasis mine.]

This may not be surprising coming from the guy who built four entire game engines between the three Run games, but I’m sympathetic to Martin. I know all too well that a thing that looks almost perfect from an outsider’s perspective can be a mess inside.

The Codeless Code: Case 105: Navigation

The Codeless Code is a collection of vaguely-Zen-like stories/lessons about programming. The premise is odd at first, but just go with it.

A young nun approached master Banzen and said:

“When first presented with requirements I created a rough design document, as is our way.

“When the rough design was approved I began a detailed design document, as is our way. In so doing I realized that my rough design was ill-considered, and thus I discarded it.

“When the detailed design was approved I began coding, as is our way. In so doing I realized that my detailed design was ill-considered, and thus I discarded it.

“My question is this:

“Since we must refactor according to need, and since all needs are known only when implementation is underway, can we not simply write code and nothing else? Why must we waste time creating design documents?”

Banzen considered this. Finally he nodded, saying:

“There is no more virtue in the documents than in a handful of leaves: you may safely forgo producing either one. Before master Mugen crossed the Uncompiled Wasteland he made eight fine maps of the route he planned to take. Yet when he arrived at the temple gates he burned them on the spot.”

The nun took her leave in high spirits, but as she reached the threshold Banzen barked: “Nun!”

When the nun turned around, Banzen said:

“Mugen was only able to burn the maps because he had arrived.”

I hope the analogy here is clear. When Martin built the original Marble Machine, he produced a single song and retired it. He then built the Marble Machine X, and plans to produce a single song before retiring it too. Now he’s working on the Marble Machine X-T, and he’s hoping that “when you build your third house, you get it right” applies here too.

He could never have made it this far if not for the first two machines. If he hadn’t built the original, he wouldn’t have known where to start on the second. If not for spending years on the MMX fixing all kinds of issues and making it (seemingly) almost work, he wouldn’t know where to start designing the third. Years of building the machine gave him a clearer picture than any amount of planning, and that picture is the only reason he can perform the “first” step of making his requirements less dumb.

I don’t think Martin could have gotten the requirements right on his first or second try, but it’s good that he tried. That was the other point of the “Navigation” parable. Mugen was only able to burn the maps because he had arrived. If Martin hadn’t started by making a solid plan, the MMX could not have been as good as it ended up being. If the MMX hadn’t reached the point of “almost working,” its greatest flaws wouldn’t have been exposed.

The Codeless Code: Case 91: The Soul of Wit

And now we arrive at how this relates to my own work. As I said at the beginning, I’m not starting anything over. However, I recently realized I needed to pivot a little.

I had built my code one feature at a time. Like Martin testing 30,000 marbles, I tested simple cases, over and over, and they worked. Then, like Martin livestreaming actual music, I devised a real-world example. It was basic, but it was something someone might actually want to do.

And that led to a cascade of problems. Things I hadn’t thought of while planning but which were obvious in retrospect. Problems with easy solutions. Problems with hard solutions. All kinds of stuff.

I was capable of fixing these problems. In fact, I had a couple different avenues to explore; at least one would certainly have worked. How could I be so certain I was on the wrong track?

Wangohan […] emailed his predicament to the telecommuting nun.

“I know nothing of this framework,” the nun wrote back. “Yet send me your code anyway.”

Wangohan did as he was asked. In less than a minute his phone rang.

“Your framework is not right,” said Zjing. “Or else, your code is not right.”

This embarrassed and angered the monk. “How can you be so certain?” he demanded.

“I will tell you,” said the nun.

Zjing began the story of how she had been born in a distant province, the second youngest of six dutiful daughters. Her father, she said, was a lowly abacus-maker, poor but shrewd and calculating; her mother had a stall in the marketplace where she sold random numbers. In vivid detail Zjing described her earliest days in school, right down to the smooth texture of the well worn teak floors and the acrid yet not unpleasant scent of the stray black dog that followed her home in the rain one day.

“Enough!” shouted the exasperated Wangohan when a full hour had passed, for the nun’s narrative showed no sign of drawing to a close. “That is no way to answer a simple question!”

“How can you be so certain?” asked Zjing.

I was writing a tutorial as I went, and that’s what tipped me off.

Each time I came up with a workaround, I had to imagine explaining it in the tutorial: “If you’re using web workers and passing a class instance from your main class to your web worker, you’ll need to add that class to the worker’s header, and then call restoreInstanceMethods() after the worker receives it. This is enough if that’s the only class you’re using but fails if you’re using subclasses that override any of the instance methods, so in that case you also need to these five other steps…”

Which is a terrible tutorial! Way too complicated. My framework was not right, or else, my code was not right. It was time to step back and rethink my requirements.

A mistaken assumption

When this all began, I had one goal: fulfill Lime issue #1081: Add asynchronous worker support for the HTML5 target. This core goal led to all my other requirements:

  1. Use web workers.
  2. Maintain backwards compatibility.
  3. Match Lime’s coding style.
  4. Write easy-to-use code.

Clearly, I was already violating requirement #4. And a couple weeks ago, I’d also realized that #1 and #2 were incompatible. Web workers were always going to break existing code, which is why I’d made them opt-in. You can’t use them by accident; you have to turn them on by hand. I was arguably also violating #3: web-worker-specific code was now taking up the majority of two files that weren’t supposed to be web-worker-specific. (Which could be ok in another context, but it’s not how Lime likes to handle these situations.)

No other feature in Lime requires reading so much documentation just to get started. Nothing else in Lime has this many platform-specific “gotchas.” Very few other things in Lime require opting in the way this does. This new code…

This new code never belonged in Lime.

That was my faulty assumption. I’d assumed that because the feature was on Lime’s wishlist, it belonged in Lime. But Lime is about making code work the same on all platforms, and web workers are just too different, no matter how much I try to cover them up with syntax sugar. In reality, the feature doesn’t belong in Lime or on Lime’s wishlist, a fact that became clear only after months of work.

Once again, I’m not starting over here. For a time, I thought I had to, but in fact my code is pretty much fine. My mistake was trying to put that code where it didn’t belong. The correct place would be a standalone library, which is the new plan. (As for Lime issue #1081, I’ve come up with a promising single-threaded option. Not quite the same, but still good.)

I’m confident I’m making the right decision here. The pieces finally fit together and the finish line is in sight.

Hopefully, Martin is making the right decision too. His finish line is farther off, but he’s made a good map to guide him there. Whether he burns that map upon arrival remains to be seen.

Guide to threads in Lime

Disclaimer: this guide focuses on upcoming features, currently only available via my pull request.

This post has a lot of overlap with my previous post, but is meant to be more accessible, putting the focus on practice instead of theory. Start here, then read the other if you want to learn more.

When to use threads

Threads are a powerful tool, allowing you to run multiple pieces of code at the same time. But before you use them, you want to be sure they’re right for your task. Threads are most appropriate when you have a large but self-contained task, such as:

  • Downloading a video. (Or downloading anything, really.)
  • Parsing a multiple-megabyte XML file.
  • Procedurally generating a level of a game.

Threads are less appropriate for small tasks, as spinning up a new thread is likely to take longer than just doing the task normally. They also aren’t great for things like physics engines, which take input and give output multiple times a frame.

Choosing a thread class

Once you decide to use threads, you need to decide which class(es) you’re going to use. These classes fall into two categories: thread pools and not-thread-pools. If you only need to perform a job once, you probably don’t need a thread pool. But if you’re going to run several related jobs, you probably want a thread pool to avoid the overhead of shutting down a thread only to spin it back up.

Haxe and Lime each provide both options.

Haxe’s classes:

  • Are lower-level. They provide more freedom, but less protection from thread-related crashes.
  • Are more portable. Since they don’t depend on Lime, they’ll work with Kha and Heaps.
  • Only work on certain targets (C#, Neko, C++, Java, Python, and HashLink).

Haxe’s classes include Thread (the base class on which everything else is built, designed to run individual jobs) and two thread pool implementations: FixedThreadPool and ElasticThreadPool. Unless you expect to need a constant number of threads, ElasticThreadPool will usually be better since it adjusts based on demand.

Lime’s classes:

  • Are higher-level. They include convenience features, safety features, and some restrictions.
  • Include support for JavaScript threads. (Or at least, they will once my pull request is done. If you know your way around Git you can use JavaScript threads now.)
  • Fall back to synchronous execution on unsupported targets. (So your code will always compile, but performance will suffer in Lua, PHP, and Flash.)

Lime’s classes are BackgroundWorker (the more basic option, designed to run a single job) and ThreadPool (which combines features from Haxe’s two thread pool classes).

Since you’re here reading this (and to keep the post to a reasonable length), I’m going to assume you’re using Lime. If you know which one you want, you can skip ahead to the BackgroundWorker or ThreadPool section.

Example: texture generation

Let’s suppose you want to generate textures using libnoise. You install it using haxelib install libnoise, glance at the provided samples, and write some test code.

Keep moving your mouse after clicking the example, and you’ll see how badly it lags. And believe me, the lag gets far worse in fullscreen.

We still want to do this computation, but we also don’t want the app to hang in the process.

Using Thread

The most basic solution, as mentioned above, is to use Haxe’s Thread class. If we didn’t care about thread safety, it could be as simple as passing our function to Thread.create():

-generateNoise(generators[index]);
+Thread.create(generateNoise.bind(generators[index]));

Even though this isn’t good code, it’s enough for this simplified example. Thanks to the background thread, the app keeps responding. Downside is, each job takes up to 1.5 seconds longer, for reasons that I can’t quite pin down. Might just be how my machine schedules threads. Fortunately it’s only obvious for functions that are normally very fast, which don’t actually have to go on a background thread.

I’d love to show all this in action, but Thread doesn’t work in HTML5.

Using BackgroundWorker

Instead of Thread, I recommend BackgroundWorker for one-off tasks. To use it, you’ll need a function with the signature Dynamic -> Void. That is, it needs to take one argument and return nothing.

  • If the function took multiple values, you’ll need to combine them into a single object. Probably an anonymous structure, but if you want to put in the effort of making a class, go right ahead.
  • If the function took zero values, you still have to add a parameter, but you don’t have to use it. I like to call it _ in that case, in reference to switch statements where a single underscore indicates an unused value.
  • Instead of returning a value, pass that value to sendComplete().
  • Instead of receiving a return value, listen for the BackgroundWorker‘s onComplete event.

Initializing a BackgroundWorker should look like this:

bgWorker = new BackgroundWorker();

//Add listeners that you've defined elsewhere.
bgWorker.onComplete.add(completeListener);
bgWorker.onProgress.add(progressListener); //If applicable
bgWorker.onError.add(errorListener); //If applicable

//Run a function that you've defined elsewhere, passing up to one argument.
bgWorker.run(doWorkFunction, message);

My texture generation example already uses a Dynamic -> Void function, but that function modifies the bitmap directly. A background thread should avoid modifying the main thread’s objects, so first I want to separate the code that does the heavy lifting from the code that refers to the bitmap.

Now I have a function to run on the background thread:
private function generateNoiseBytes(width:Int, height:Int, generator:Generator):ByteArray. You’ll note that this very much isn’t a Dynamic -> Void function, so next I’ll follow the checklist above. Merge the three arguments into one, replace return bytes with bgWorker.sendComplete(bytes), initialize bgWorker, add an onComplete listener, and call run().

With these changes in place, my example is up and running. There’s still the 1.5 second delay, but the code is far more thread-safe than above.

Bonus round: I could check bgWorker.canceled occasionally, and return if it’s true. Plus, I still haven’t added JavaScript support. This code will function in JavaScript, but the whole thing will freeze, as if it wasn’t using BackgroundWorker at all. (See the original demo for what this looks like.)

Using ThreadPool

For functions that you plan to run repeatedly, I recommend ThreadPool. To use it, you’ll need a function with the signature Dynamic -> Void. That is, it needs to take one argument and return nothing.

  • If the function took multiple values, you’ll need to combine them into a single object. Probably an anonymous structure, but if you want to put in the effort of making a class, go right ahead.
  • If the function took zero values, it might not be suitable for this. You’re going to be running it multiple times, with the only difference being the input.
  • Instead of returning a value, pass that value to sendComplete().
  • Instead of receiving the return values, listen for the ThreadPool‘s onComplete event. Note that all listeners will be called when any job in the pool calls sendComplete(); you can’t listen for a specific job. If it matters, make sure to include distinguishing information in the event itself.

But hang on a second, that’s how you run the same function over and over. What if you want to run several different functions? The trick in that case is to write a Dynamic -> Void function that takes another function as an argument and runs that function. (To see this in action, look at FutureWork.)

In any case, initializing a ThreadPool should look like this:

//Provide a function that you've defined elsewhere.
threadPool = new ThreadPool(doWorkFunction);

//Add listeners that you've defined elsewhere.
threadPool.onComplete.add(completeListener);
threadPool.onProgress.add(progressListener); //If applicable
threadPool.onError.add(errorListener); //If applicable

//Optionally, adjust the number of threads. These numbers are just
//examples, so don't read into them.
threadPool.minThreads = 1;
threadPool.maxThreads = 3;

//Run your previously-specified function, passing up to one argument.
threadPool.queue(message);

My texture generation example already uses a Dynamic -> Void function, but that function modifies the bitmap directly. A background thread should avoid modifying the main thread’s objects, so first I want to separate the code that does the heavy lifting from the code that refers to the bitmap.

Now I have a function to run on the background thread:
private function generateNoiseBytes(width:Int, height:Int, generator:Generator):ByteArray. You’ll note that this very much isn’t a Dynamic -> Void function, so next I’ll follow the checklist above. Merge the three arguments into one, replace return bytes with threadPool.sendComplete(bytes), initialize threadPool, add an onComplete listener, and call queue().

With these changes in place, my example is up and running. There’s still the 1.5 second minimum, but the code is far more thread-safe than above. Setting threadPool.minThreads = 1 doesn’t seem to help with this delay, but I’m doing it anyway because in theory it’s more efficient.

I still haven’t added JavaScript support. This code will function in JavaScript, but the whole thing will freeze, as if it wasn’t using BackgroundWorker at all. (See the original demo for what this looks like.)

I’ve been doing my best to write thread-safe code, but I’ve also glossed over the details. Now, it’s time for a closer look.

Improving thread safety

Both BackgroundWorker and ThreadPool provide tools for thread safety, but you have to make sure to use them.

The biggest problem with Haxe’s threads is the fact that they’re just so convenient. You can access any variable from any thread, so that function you designed for the main thread will still work if you run it on a background thread. That’s great, except sharing memory across threads has a nasty habit of creating race conditions. For instance, imagine you’re calling Assets.getBitmapData(), which contains this code:

if (useCache && cache.enabled && cache.hasBitmapData(id))
{
    var bitmapData = cache.getBitmapData(id);

    if (isValidBitmapData(bitmapData))
    {
        return bitmapData;
    }
}

//...

if (useCache && cache.enabled)
{
    cache.setBitmapData(id, bitmapData);
}

If two threads call this function at once, they might both end up adding the same bitmap to the cache, since it wasn’t there when either of them started, and neither checked again. Best-case scenario, one overwrites the other and some memory goes to waste. Worst-case scenario, the setBitmapData() calls happen at once, and completely breaking the underlying Map. (Yes, that link is discussing Java, but I have no reason to believe Haxe’s maps are safer.)

Problems like this abound when you’re sharing memory, and you’re going to need to comb through your code looking for references to outside variables. Even ones hidden behind function calls, like in Assets.getBitmapData(). While I’m far from an expert on race conditions, here are the guidelines I follow:

  • Reading a value has manageable risks.
    • If the value is never going to be modified, you’re golden. You can read it as much as you like, on whichever thread you like. (Same thing if it isn’t modified until the thread is done. For instance, setting bgWorker = null in the onComplete/onError event listener. If that’s the only place you ever set bgWorker = null, then your background thread doesn’t have to worry about it being null.)
    • If other threads are going to modify the value, be aware that they could do it at any time. This is mainly a problem if you’re accessing the variable multiple times in a row, expecting it to be the same both times. For instance, if the background thread runs if(x != null && x.y != null) /* ... */, the main thread could set x to null after the first check but before the second, and then you have a null pointer error on your hands. Solution? Cache it: var localX = x; if(localX != null && localX.y != null) /* ... */.
  • Setting a value is dangerous. Take precautions.
    • One option is to declare that the variable should only be modified by the background thread, that the main thread is only allowed to read it. Assuming you only have the one background thread, this should work just as well as variables that only the main thread modifies. (Only this time, the main thread has to follow the “reading a value” guidelines above.)
    • Other than that, the simplest safe option is to use the built-in sendProgress() function. Both BackgroundWorker and ThreadPool provide this, and it will safely send any value you like back to the main thread. For example, if you’re trying to add bitmapData to an array, you’d call bgWorker.sendProgress(bitmapData), allowing the main thread to store it:
      bgWorker.onProgress.add(function(data:Dynamic) {
          if(data is BitmapData)
              bitmaps.push(bitmapData);
      })
    • If you’re an advanced user, you can use Haxe’s built-in tools, like Mutexes and Deques. Just be warned that these won’t work in JavaScript, if you care about that.

Going back to Assets, we need to disable useCache to avoid an unsafe operation. Optionally, the background thread could make its own cache, to make up for this.

With these adjustments, you should have safe threads on most targets. However, if you want threads in JavaScript, there’s more you need to know.

Supporting JavaScript

And now we come to the final boss: web workers. My other post covers these in greater detail, but here I’m going to focus on the how instead of the why. (And as a reminder, these are only available as a pull request for now.)

If you want threads in JavaScript, first you have to enable web workers. (They’re disabled by default because they break backwards compatibility.) If you don’t opt in, BackgroundWorker and ThreadPool will execute synchronously, meaning when you run them, the app will hang until the function finishes. To opt in, add this to project.xml above your <haxelib /> tags:

<haxedef name="lime-web-workers" if="html5" />

With this enabled, Lime will attempt to run your code in a web worker, and it will almost certainly crash. A lot of features you take for granted in Haxe development simply won’t work, all because memory isn’t shared. Briefly:

  • You can’t access static or instance variables. You can’t even access this.
  • You can’t run static or instance functions.
  • You can’t create instances of Lime or OpenFL’s classes, or your own classes. Only built-in JavaScript classes are available.
  • You can’t pass functions to or from a worker.

Fortunately, sendProgress(), sendComplete(), and sendError() will still work exactly as they used to.

For everything else, there are workarounds.

  • If the worker calls an outside function, consider whether you can inline that function. Inline functions get copied over.
  • If the worker will refer to the class, add that class to the worker’s headerCode. For instance, running bgWorker.headerCode.addClass(MyClass) on the main thread will make a copy of MyClass available to the worker. You might need to do some troubleshooting:
    • If you get errors about missing classes, you’ll need to add those classes to the header as well. (If this starts to spiral out of control, rethink whether you need the original class.)
    • Some or all of the class’s static variables may not be initialized, which can also cause errors. The worker will have to initialize these variables. (Reminder: this isn’t the same static variable as on the main thread. It’s a copy, and isn’t shared.)
  • If the worker tries to refer to an outside variable, instead pass that variable as a message. For instance, starting the worker with bgWorker.run(myFunction, x) will pass a copy of x to myFunction. (Recall that if you want to copy multiple values, you’ll need to add them to an anonymous structure: bgWorker.run(myFunction, { x: x, y: y }).)
    • You may find that some instance methods are unavailable. For instance, x.getLength() might not exist. In that case, either inline the functions or call ThreadFunction.restoreInstanceMethods(x) at the beginning of the worker function.
  • If you want to pass a function as a message (as often happens when using ThreadPool), first convert it to a ThreadFunction. Note that explicit casts won’t work; you must use an implicit cast:
    //Explicit cast (will not work)
    var t = cast(myFunction, ThreadFunction<() -> Void>);
    
    //Implicit cast
    var t:ThreadFunction<() -> Void> = myFunction;
    
    //Implicit cast
    var t = (myFunction:ThreadFunction<() -> Void>);

Sadly, the restrictions imposed by JavaScript mean there’s no one-size-fits-all solution. You’ll have to look over your code and decide what compromises you can make. You may need to dig deeper.

(Hopefully coming soon: a working HTML5 version of that texture generation example.)

Good luck out there!

Web Workers in Lime

If you haven’t already read my guide to threads, I suggest starting there.

I’ve spent the last month implementing web worker support in Lime. (Edit: and then I spent another month after posting this.) It turned out to be incredibly complicated, and though I did my best to include documentation in the code, I think it’s worth a blog post too. Let’s go over what web workers are, why you might want to use them, and why you might not want to use them.

To save space, I’m going to assume you’ve heard of threads, race conditions, and threads in Haxe.

About BackgroundWorker and ThreadPool

BackgroundWorker and ThreadPool are Lime’s two classes for safely managing threads. They were added back in 2015, and have stayed largely unchanged since. (Until this past month, but I’ll get to that.)

The two classes fill different roles. BackgroundWorker is ideal for one-off jobs, while ThreadPool is a bit more complex but offers performance benefits when doing multiple jobs in a row.

BackgroundWorker isn’t too different from calling Thread.create() – both make a thread and run a single job. The main difference is that BackgroundWorker builds in safety features.

Recently, Haxe added its own thread pool implementations: FixedThreadPool has a constant number of threads, while ElasticThreadPool tries to add and remove threads based on demand. Lime’s ThreadPool does a combination of the two: you can set the minimum and maximum number of threads, and it will vary within that range based on demand. Plus it offers structure and safety features, just like BackgroundWorker. On the other hand, ThreadPool lacks ElasticThreadPool‘s threadTimeout feature, so threads will exit instantly if they don’t have a job to do.

I always hate reinventing the wheel. Why does Lime need a ThreadPool class when Haxe already offers two? (Ignoring the fact that Lime’s came first.) Just because of thread safety? There are other ways to achieve that.

If only Haxe’s thread pools worked in JavaScript…

Web workers

Mozilla describes web workers as “a simple means for web content to run scripts in background threads.” “Simple” is a matter of perspective, but they do allow you to create background threads in JavaScript.

Problem is, they have two fundamental differences from Haxe’s threads, which is why Haxe doesn’t include them in ElasticThreadPool and FixedThreadPool.

  • Web workers use source code.
  • Web workers are isolated.

Workers use source code

Web workers execute a JavaScript file, not a JavaScript function. Fortunately, it is usually possible to turn a function back into source code, simply by calling toString(). Usually. Let’s start with how this works in pure JavaScript:

function add(a, b) {
    return a + b;
}

console.log(add(1, 2)); //Output: 3
console.log(add.toString()); //Output:
//function add(a, b) {
//    return a + b;
//}

That first log() call is just to show the function working. The second shows that we get the function source code as a string. It even preserved our formatting!

If we look at the examples, we find that it goes to great lengths to preserve the original formatting.

toString() input toString() output
function f(){} "function f(){}"
class A { a(){} } "class A { a(){} }"
function* g(){} "function* g(){}"
a => a "a => a"
({ a(){} }.a) "a(){}"
({ [0](){} }[0]) "[0](){}"
Object.getOwnPropertyDescriptor({ get a(){} }, "a").get "get a(){}"
Object.getOwnPropertyDescriptor({ set a(x){} }, "a").set "set a(x){}"
Function.prototype.toString "function toString() { [native code] }"
(function f(){}.bind(0)) "function () { [native code] }"
Function("a", "b") "function anonymous(a\n) {\nb\n}"

That’s weird. In two of those cases, the function body – the meat of the code – has been replaced with “[native code]”. (That isn’t even valid JavaScript!) As the documentation explains:

If the toString() method is called on built-in function objects or a function created by Function.prototype.bind, toString() returns a native function string

In other words, if we ever call bind() on a function, we can’t get its source code, meaning we can’t use it in a web worker. And wouldn’t you know it, Haxe automatically calls bind() on certain functions.

Let’s try writing some Haxe code to call toString(). Ideally, we want to write a function in Haxe, have Haxe translate it to JavaScript, and then get its JavaScript source code.

class Test {
    static function staticAdd(a, b) {
        return a + b;
    }
    
    function add(a, b) {
        return a + b;
    }
    
    static function main() {
        var instance = new Test();
        
        trace(staticAdd(1, 2));
        trace(instance.add(2, 3));
        
        #if js
        trace((cast staticAdd).toString());
        trace((cast instance.add).toString());
        #end
    }
    
    inline function new() {}
}

If you try this code, you’ll get the following output:

Test.hx:15: 3
Test.hx:16: 5
Test.hx:18: staticAdd(a,b) {
        return a + b;
    }
Test.hx:19: function() {
    [native code]
}

The first two lines prove that both functions work just fine. staticAdd is printed exactly like it appears in the JavaScript file. But instance.add is all wrong. Let’s look at the JS source to see why:

static main() {
    let instance = new Test();
    console.log("Test.hx:15:",Test.staticAdd(1,2));
    console.log("Test.hx:16:",instance.add(2,3));
    console.log("Test.hx:18:",Test.staticAdd.toString());
    console.log("Test.hx:19:",$bind(instance,instance.add).toString());
}

Yep, there it is. Haxe inserted a call to $bind(), a function that – perhaps unsurprisingly – calls bind().

Turns out, Haxe always inserts $bind() when you try to refer to an instance function. This is in fact required: otherwise, the function couldn’t access the instance it came from. But it also means we can’t use instance functions in web workers. Or can we?

After a lot of frustration and effort, I came up with ThreadFunction. Read the source if you want details; otherwise, the one thing to understand is that it can only remove the $bind() call if you convert to ThreadFunction ASAP. If you have a variable (or function argument) representing a function, that variable (or argument) must be of type ThreadFunction.

//Instead of this...
class DoesNotWork {
    public var threadFunction:Dynamic -> Void;
    
    public function new(threadFunction:Dynamic -> Void) {
        this.threadFunction = threadFunction;
    }
    
    public function runThread():Void {
        new BackgroundWorker().run(threadFunction);
    }
}

//...you want to do this.
class DoesWork {
    public var threadFunction:ThreadFunction<Dynamic -> Void>;
    
    public function new(threadFunction:ThreadFunction<Dynamic -> Void>) {
        this.threadFunction = threadFunction;
    }
    
    public function runThread():Void {
        new BackgroundWorker().run(threadFunction);
    }
}

class Main {
    private static function main():Void {
        new DoesWork(test).runThread(); //Success
        new DoesNotWork(test).runThread(); //Error
    }
    
    private static function test(_):Void {
        trace("Hello from a background thread!");
    }
}

Workers are isolated

Once we have our source code, creating a worker is simple. We take the string and add some boilerplate code, then construct a Blob out of this code, then create a URL for the blob, then create a worker for that URL, then send a message to the worker to make it start running. Or maybe it isn’t so simple, but it does work.

Web workers execute a JavaScript source file. The code in the file can only access other code in that file, plus a small number of specific functions and classes. But most of your app resides in the main JS file, and is off-limits to workers.

This is in stark contrast to Haxe’s threads, which can access anything. Classes, functions, variables, you name it. Sharing memory like this does of course allow for race conditions, but as mentioned above, BackgroundWorker and ThreadPool help prevent those.

For a simple example:

class Main {
    private static var luckyNumber:Float;
    
    private static function main():Void {
        luckyNumber = Math.random() * 777;
        new BackgroundWorker().run(test);
    }
    
    private static function test(_):Void {
        trace("Hello from a background thread!");
        trace("Your lucky number is: " + luckyNumber);
    }
}

On most targets, any thread can access the Main.luckyNumber variable, so test() will work. But in JavaScript, neither Main nor luckyNumber will have been defined in the worker’s file. And even if they were defined in that file, they’d just be copies. The value will be wrong, and the main thread won’t receive any changes made.

So… how do you transfer data?

Passing messages

I’ve glossed over this so far, but BackgroundWorker.run() takes up to two arguments. The first, of course, is the ThreadFunction to run. The second is a message to pass to that function, which can be any type. (And if you need multiple values, you can pass an array.)

Originally, BackgroundWorker was designed to be run multiple times, each time reusing the same function but working on a new set of data. It wasn’t well-optimized (ThreadPool is much more appropriate for that) nor well-tested, but it was very convenient for implementing web workers.

See, web workers also have a message-passing protocol, allowing us to send an object to the background thread. You know, an object like BackgroundWorker.run()‘s second argument:

class Main {
    private static var luckyNumber:Float;
    
    private static function main():Void {
        luckyNumber = Math.random() * 777;
        new BackgroundWorker().run(test, luckyNumber);
    }
    
    private static function test(luckyNumber:Float):Void {
        trace("Hello from a background thread!");
        trace("Your lucky number is: " + luckyNumber);
    }
}

The trick is, instead of trying to access Main.luckyNumber (which is on the main thread), test() takes an argument, which is the same value except copied to the worker thread. You can actually transfer a lot of data this way:

new BackgroundWorker().run(test, {
    luckyNumber: Math.random() * 777,
    imageURL: "https://www.example.com/image.png",
    cakeRecipe: File.getContent("cake.txt"),
    calendar: Calendar.getUpcomingEvents(10)
});

Bear in mind that your message will be copied using the structured clone algorithm, a deep copy algorithm that cannot copy functions. This sets limits on what kinds of messages you can pass. You can’t pass a function without first converting it to ThreadFunction, nor can you pass an object that contains functions, such as a class instance.

Copying your message is key to how JavaScript prevents race conditions: memory is never shared between threads, so two threads can’t accidentally access the same memory location at the wrong time. But if there’s no sharing, how does the main thread get any information back from the worker?

Returning results

Web workers don’t just receive messages, they can send them back. The rules are the same: everything is copied, no functions, etc.

The BackgroundWorker class provides three functions for this, each representing something different. sendProgress() for status updates, sendError() if something goes horribly wrong, and sendComplete() for the final product. (You may recall that workers don’t normally have access to Haxe functions, but these three are inlined. Inline functions work fine.)

It’s at about this point we need to talk about another problem with copying data. One common reason to use background threads is to process large amounts of data. Suppose you produce 10 MB of data, and you want to pass it back once finished. Your computer is going to have to make an exact copy of all that data, and it’ll end up taking 20 MB in all. Don’t get me wrong, it’s doable, but it’s hardly ideal.

It’s possible to save both time and memory using transferable objects. If you’ve stored your data in an ArrayBuffer, you can simply pass a reference back to the main thread, no copying required. The worker thread loses access to it, and then the main thread gains access (because unlike Haxe, JavaScript is very strict about sharing memory).

ArrayBuffer can be annoying to use on its own, so it’s fortunate that all the wrappers are natively available. By “wrappers,” I’m talking about Float32Array, Int16Array, UInt8Array, and so on. As long as you can represent your data as a sequence of numbers, you should be able to find a matching wrapper.

Transferring a buffer looks like this: backgroundWorker.sendComplete(buffer, [buffer]). I know that looks redundant, and at first I thought maybe backgroundWorker.sendComplete(null, [buffer]) could work instead. But the trick is, the main thread will only receive the first argument (a.k.a. the message). If the message doesn’t contain some kind of reference to buffer, then the main thread won’t have any way to access buffer.

But the two arguments don’t have to be identical. You can pass a wrapper (e.g., an Int16Array) as the message, and transfer the buffer inside: backgroundWorker.sendComplete(int16Array, [int16Array.buffer]). The Int16Array numeric properties (byteLength, byteOffset, and length) will be copied, but the underlying buffer will be moved instead.

Great! You now know how to transfer an ArrayBuffer. But what if you have your own class instances you want to use? It’s time to take another look at that.

Using classes in web workers

(Still working on this part; check back later.)

Status update: fall 2021

Since my last status update, I’ve spent about half my time working on Runaway and half working on an Android release.

In Runaway news, I’ve made important design decisions about how to load levels. While I’m proud of the design, it’s complicated enough that I’ll have to save it for a post with the technical tag.

I’m pleased to announce that after a difficult month, Run Mobile has been updated! On Android, at least. We made it with a few days to spare before Google’s November deadline, meaning that they won’t lock the app down.

Finally, I’m continuing to work on the story in the background. I have several cutscene scripts in the works – much like the one I released in June – but it turns out, it’s easier to start new ones than to sit down and finish them. Endings are hard!

Moving forward:

  • I may take a week or so to update some other Android libraries. The past month was less painful than expected, so I might as well get those updated before Google makes another big change to break my workflow. (Update: it took one day.)
  • I’ll get back to work on Run. Add Infinite Mode, adjust the physics, maybe do another pass on the animations.
  • I’ll start rebuilding Run 2 using Runaway.
  • Maybe I’ll even finish a cutscene script.

Supporting 64-bit devices

When I left off last week, I was told that I needed to upload my app in app bundle format, instead of APK format. The documentation there may seem intimidating, but clicking through the links eventually brought me to instructions for building an app bundle with Gradle. (There are other ways to do it, but I’m already using Gradle.) It’s as simple as swapping the command I give to Gradle – instead of assembleRelease, I run bundleRelease. And it seems to work. At least, Google Play accepts the bundle.

But then Google gives me another error. I’ve created a 32-bit app, and from now on Google requires 64-bit support. I do like 64-bit code in theory, but at this stage it’s also kind of scary. I’ll need to mess with the C++ compile process, which I’m not familiar with. And I’m stuck with old versions of Lime and hxcpp, so even if they’ve added 64-bit support, I can’t use that.

Initially, I got a bit of false hope, as the documentation says “Enabling builds for your native code is as simple as adding the arm64-v8a and/or x86_64, depending on the architecture(s) you wish to support, to the ndk.abiFilters setting in your app’s ‘build.gradle’ file.” I did that, and it seemed to work. It compiled and uploaded, at least, but it turned out the app wouldn’t start because it couldn’t find libstd.so.

I knew I’d seen that error ages ago, but wasn’t sure where. Eventually after a lot of trial and error, I tracked it down to the ndk.abiFilters setting. Yep, that was it. Attempting to support 64-bit devices just breaks it for everybody, and the reason is that I don’t actually have any 64-bit shared libraries (a.k.a. .so files). This means I need to:

  1. Track down all the shared libraries the app uses.
  2. Compile 64-bit versions of each one.
  3. Include them in the build.

And I have only about a week to do it.

Tracking down shared libraries

The 32-bit app ends up using seven shared libraries: libadcolony.so, libApplicationMain.so, libjs.so, liblime-legacy.so, libregexp.so, libstd.so, and libzlib.so.

Three of them (regexp, std, and zlib) are from hxcpp, located in hxcpp’s bin/Android folder. lime-legacy is from Lime, naturally, and is found in Lime’s lime-private/ndll/Android folder. ApplicationMain is my own code, and is found inside my project’s bin/android/obj folder. From each of these locations, the shared libraries are copied into my Android project, specifically into app/src/main/jniLibs/armeabi.

The “adcolony” and “js” libraries are slightly different. Both of those are downloaded during when Gradle compiles the Android app. Obviously the former is from AdColony, and I think the latter is too. Since both have 64-bit versions, I don’t think I need to worry about them.

Interestingly, Lime already has a few different versions of liblime-legacy.so, but no 64-bit version. If I had a 64-bit version, it would go in app/src/main/jniLibs/arm64-v8a, where Gradle will look for it. (Though since I’m using an older Gradle plugin than 4.0, I may have to figure out how to include CMake IMPORTED targets, whatever that means.)

As far as I can tell, that’s a complete list. C++ is the main problem when dealing with 32- and 64-bit support. On Android, C++ code has to go in a shared object. All shared objects go inside the lib folder in an APK, and the above is a complete list of the contents of my app’s lib folder. So that’s everything. I hope.

Compiling hxcpp’s libraries

How you compile a shared library varies depending on where it’s from, so I’ll take them one at a time.

I looked at hxcpp first, and found specific instructions. Run neko build.n android from a certain folder to recompile the shared libraries. This created several versions of libstd.so, libregexp.so, and libzlib.so, including 64-bit versions. Almost no effort required.

Next, I got started working on liblime-legacy.so, but it took a while. Eventually, I realized I needed to test a simple hypothesis before I wasted too much time. Let’s review the facts:

  • When I compile for 32-bit devices only, everything works. Among other things, libstd.so is found.
  • When I compile for 32-bit and 64-bit devices, it breaks. libstd.so is not found.
  • Even though it can’t be found, libstd.so is present in the APK, inside lib/armeabi. (That’s the folder with code for 32-bit devices.)
  • lib/arm64-v8a (the one for 64-bit devices) contains only libjs.so and libadcolony.so.

My hypothesis: because the arm64-v8a folder exists, my device looks only in there and ignores armeabi. If I put libstd.so there, the app should find it. If not, then I’m not going to be able to use liblime-legacy.so either.

Test #1: The 64-bit version of libstd.so is libstd-64.so. (Unsurprisingly.) Let’s add it to the app under that name. I don’t think this will work, but I can at least make sure it ends up in the APK. Result: libstd-64.so made it into the APK, and then the app crashed because it couldn’t find libstd.so.

Test #2: Actually name it the correct thing. This is the moment of truth: when the app crashes (because it will crash), what will the error message be? Result: libstd.so made it into the APK, and then the app crashed because it couldn’t find libregexp.so. Success! That means it found the library I added.

Test #3: Add libregexp.so and libzlib.so. This test isn’t so important, but I have the files sitting around, so may as well see what happens. My guess is, liblime-legacy.so is next. Result: could not find liblime-legacy.so, as I guessed.

(For the record, I’m not doing any of this the “right” way, which means if I clear my build folder or switch to another machine, it’ll stop working. But I’ll get to that later.)

Compiling Lime’s library

Like hxcpp, Lime comes with instructions, but unlike hxcpp, they didn’t work first try. From the documentation you’d think lime rebuild android -64 would do it, but that’s for Intel processors (Android typically uses Arm). So the correct command is lime rebuild android -arm64, but even that doesn’t work.

Turns out, AndroidPlatform only compiles for three specific 32-bit architectures, and ignores any others you request. I’m going to need to add a 64-bit option there.

Let’s jump forwards in time and see what the latest version of AndroidPlatform looks like. …What do you know, it now supports 64-bit architectures. Better yet, the rest of the code is practically unchanged (they renamed a class, but that’s about it). Since it’s so similar, I should be able to copy over the new code, adjusting only the class name. Let’s give that a try…

…and I figured out why the 64-bit option wasn’t included yet. The compiler immediately crashes with a message that it can’t find stdint.h. Oh, and the error occurred inside the stdint.h file. So it went looking for stdint.h, and found it, but then stdint.h told it to find stdint.h, and it couldn’t find stdint.h. Makes sense, right?

According to the tech support cheat sheet, what you do is search the web for a few words related to the problem, then follow any advice. When I did, I found someone who had the same bug (including the error message pointing to stdint.h), and the accepted solution was to target Android 21 because that’s the first one that supports 64-bit. Following that advice, I did a find and replace, searching all of Lime’s files for “android-9” and replacing with “android-21”. And it worked.

As expected, fixing one problem just exposed another. I got an error about how casting from a pointer to int loses precision. I’m certain this is only the first of many, many similar errors, since all this code was designed around 32 bit pointers. It should be fixable, in one of several ways. As an example of a bad way to fix it, I tried changing int to long. A long can hold a 64-bit pointer, but it’s overkill on 32-bit devices, and it’s even possible that the mismatch would cause subtle errors.

But hey, with that change, the compile process succeeded. Much to my surprise. I was expecting an endless stream of errors from all the different parts of the code that aren’t 64-bit-compatible, but apparently those all turned into warnings, so I got them all at once instead of one at a time. These warnings ended up falling into three distinct groups.

  • Three warnings came from Lime-specific code. After consulting the modern version of this code, I made some educated guesses about how to proceed. First, cast values to intptr_t instead of int, because the former will automatically adjust for 64 bits. (Actually I went with uintptr_t, but it probably doesn’t matter.) Second, when the pointer is passed to Haxe code, pass it as a Float value, because in Haxe that’s a 64-bit value. Third, acknowledge that step 2 was very weird and proceed anyway, hoping it doesn’t matter.
  • A large number of warnings came from OpenAL (an open-source audio library, much like how OpenGL is an open-source graphics library and OpenFL is an open-source Flash library). I was worried that I’d have to fix them all by hand, but eventually I stumbled across a variable to toggle 32- vs. 64-bit compilation. But luckily, the library already supported 64 bits, and I just had to enable it. (Much safer than letting me implement it.)
  • cURL produced one warning – apparently it truncates a pointer to use as a random seed. I don’t know if that’s a good idea, but I do know the warning is irrelevant. srand works equally well if you give it a full 32-bit pointer or half of a 64-bit pointer.

Ignoring the cURL warning, the build proceeded smoothly. Four down, one to go.

Copying files the correct way

As I mentioned earlier, I copied hxcpp’s libraries by hand, which is a temporary measure. The correct way to copy them into the app is through Lime, specifically AndroidPlatform.hx. Like last time I mentioned that file, this version only supports 32-bit architectures, but the latest version supports more. Like before, my plan is copy the new version of the function, and make a few updates so it matches the old.

Then hit compile, and if all goes well, it should copy over the 64-bit version of the four shared libraries I’ve spent all week creating. And if I’m extra lucky, they’ll even make it into the APK. Fingers crossed, compiling now…

Compilation done. Let’s see the results. In the Android project, we look under jniLibs/arm64-v8a, and find:

  1. libApplicationMain.so
  2. liblime-legacy.so
  3. libregexp.so
  4. libstd.so
  5. libzlib.so

Hey, cool, five out of four libraries were copied successfully. (Surprise!)

I might’ve glossed over this at the start of this section, but AndroidPlatform.hx is what compiles libApplicationMain.so. When I enabled the arm64 architecture to make it copy all the libraries, it also compiled my Haxe code for 64 bits and copied that in. On the first try, too.

Hey look at that, I have what should be a complete APK. Time to install it. Results: it works! The main menu shows up (itself a huge step), and not only that, I successfully started the game and played for a few seconds.

More audio problems

And then it crashed when I turned the music on, because apparently OpenAL still doesn’t work. There was a stack trace showing where the null pointer error happened, but I had to dig around some more to figure out where the variable was supposed to be set. (It was actually in a totally different class, and that class had printed an error message but I’d ignored the error message because it happened well before the crash.)

Anyway, the problem was it couldn’t open libOpenSLES.so, even though that library does exist on the device. And dlerror() returned nothing, so I was stumped for a while. I wrote a quick if statement to keep it from crashing, and resigned myself to a silent game for the time being.

After sleeping on it, I poked around a little more. Tried several things without making progress, but then I had the idea to try loading the library in Java. Maybe Java could give me a better error message. And it did! “dlopen failed: "/system/lib/libOpenSLES.so" is 32-bit instead of 64-bit.” Wait, why does my 64-bit phone not have 64-bit libraries? Let me take another look… aha, there’s also a /system/lib64 folder containing another copy of libOpenSLES.so. I bet that’s the one I want. Results: it now loads the library, but it freezes when it tries to play sound.

It didn’t take too long to track the freeze down to a threading issue. It took a little longer to figure out why it refused to create this particular thread. It works fine in the 32-bit version, but apparently “round robin” scheduling mode is disallowed in 64-bit code. Worse, there’s very little information to be found online, and what little there is seems to be for desktop machines with Intel processors, not Android devices with Arm processors. My solution: use the default scheduling mode/priority instead of round robin + max priority. This seems to work on my device, and the quality seems unaffected. Hopefully it holds up on lower-end devices too.

Conclusion

And that’s about it for this post. It was a lot of work, and I’m very glad it came together in the end.

Looking back… very little of this code will be useful in the future. These are stopgap measures until Runaway catches up. Once that happens, I can use the latest versions of Lime and OpenFL, which support 64-bit apps (and do a better job of it than I did here). I will be happy once I can consign this code to the archives and never deal with it again.

But.

Work like this is never wasted. The code may not be widely useful, but I’ve learned a lot about Android, C++, and how the two interact. I’ve learned about Lime, too, digging deeper into its build process, its native code, and several of its submodules. (Which will definitely come in handy because I’m one of the ones responsible for its future.)

The app is just about ready for release, with about a week to spare. But I still have a couple things to tidy up, and Kongregate would like some more time to test, so we’re going to aim for the middle of next week, still giving us a few days to spare.

Testing the Google Play Billing Library

To recap, I’ve been working to update the Google Play Billing Library, because if I don’t update it by November 1, Google will lock the app and I’ll never be able to update it again. Pretty important.

When I left off last week, I’d rewritten a lot of code, but hadn’t been able to test it due to an infinite loop. This week, I was hoping to fix the infinite loop quickly and move on to the many bugs in my as-yet-untested billing library integration.

Tracking down the infinite loop

This was a tough bug. While it’s far from the toughest I’ve ever tracked down, it seemed mysterious enough that shortly before finding it, I was nearly ready to give up and look for a workaround. (There’s a motivational poster in there somewhere.) But I’m getting ahead of myself.

At first, I had almost nothing to go off of. The app would start, log a few unrelated messages, and then hang with a blank screen, with no clue as to why. The lag (and eventual “app not responding” message) tipped me off that it was an infinite loop, but that isn’t much to go on.

My biggest clue (or so I thought) was knowing what code most recently changed: the billing library. So somehow that must be looping forever, right? Well… I commented out the “connect to billing library” code, and nothing changed. I commented out more and more features of the library, followed by lots of other related and barely-related code, all to no avail.

I finally got the app to run by disabling all internet-related libraries and features (I have a compiler flag for that), but nothing short of that worked. I started to suspect that the mere presence of the billing library caused this error, even without using it in any way. I considered dropping all network features and releasing the app with no logins, purchases, or cloud saves, since that was starting to seem easier than tracking this down. Desperate times and all that.

Then I remembered I wasn’t out of debugging techniques just yet. Next up was version control. I could rewind the clock and get a version that worked, and carefully examine the changes from there. So I reverted the GPBL and my last few weeks of code, and compiled the app the way I’d had it working.

And it still froze. Huh.

This is a lesson I’ve learned many times over, but still sometimes forget. When something makes zero sense, you need to check your assumptions. I’d assumed that since this was a new error, my new code was at fault. But now my old code didn’t work either, and that meant I was looking in the wrong place.

I forget exactly what it was I did, but knowing this, I managed to make the app crash. That’s a good thing: with a crash comes an error message, and (at least in this case) a stack trace. This instantly narrowed the search down to a single line of code, and the line of code had nothing to do with billing.

Teaching my code patience

Turns out, I’ve been overlooking a significant issue for a while now. To make a medium-length story short, cloud save setup happens partway through the game’s setup process. I don’t know (or care) exactly when, because it shouldn’t matter.

At this point in the process, it sends a request to log in to PlayFab. You have to log in eventually if you want to use cloud saves, so I figured, might as well start it immediately so it can happen in the background. What I forgot is that Haxe’s network requests (usually) don’t happen in the background. Instead, the login command puts the rest of the setup process on hold until it connects.

Not only that, the moment you finish connecting, several different classes try to make use of the cloud data. But because game setup wasn’t done, some important variables were still null, including PlayFab.instance. When PlayFab.instance is null, it runs the PlayFab class constructor. The constructor sends a request to log in. After logging in, classes try to make use of the cloud data. But PlayFab.instance is still null… so it runs the constructor again. And again.

There’s a long-running programmer joke that you start out wondering why your code doesn’t work, and end up wondering how it ever worked. That’s me right now. (And I particularly relate to the second comment on that post: “Or if you change your code to make it work better, it doesn’t work, so you load your backup and your original code doesn’t work anymore…”)

The solution? Just don’t log in until the game is fully ready. A little bit of patience goes a long way. (I’ve also pushed a commit in case the infinite loop comes up elsewhere.)

Surprisingly, everything above only took a couple days. It felt like longer, but my computer’s clock disagrees.

Testing the GPBL (finally)

So my plan was, after resolving the infinite loop, I’d spend the rest of the week testing the billing code. This would provide lots more material for the epic conclusion to my “why developing for Android is so annoying” series.

But this plan failed, because I only ran into two problems. The first came with an error message, and the error message told me exactly how to fix it. The second required a bit of effort to figure out why things weren’t happening in order, but it wasn’t bad enough to blog about. And then… purchases just worked.

I spent nearly half the week just kind of waiting to see if Google Play would accept or reject the app. (Answer: reject. Google now insists on app bundles, so that’s what I’m working on next.)

So, uh… maybe mobile development isn’t quite as bad as I remember.

Updating the Google Play Billing Library

As I’ve mentioned a couple posts ago, I need to update this specific library by November. Google was nice enough to provide a migration guide to help with the process, though due to my own setup, it isn’t quite as easy as that.

As I start writing this post, I’ve just spent several hours restructuring Java code, and I anticipate several more. I’ve been tabbing back and forth between the guide linked above and the API documentation, going back and forth as to class structure. I’m trying to track in-app purchases (you know, the ones bought with actual money), but this is made harder by the fact that Google doesn’t, you know, track those purchases.

The problem with purchases on Google Play

Let me introduce you to the concept of “consuming” purchases. On the Play Store, each in-app purchase has two states: “owned” and “not owned.” If owned, you can’t buy it again. This is a problem for repeatable purchases (like in Run Mobile), because you’re supposed to be able to buy each option as much as you like. Google’s solution is this: games should record the purchase, then “consume” it, setting it back to “not owned” and allowing the player to buy it again.

I know for a fact Google keeps track of how many times each player has purchased each item, but that information is not available to the game. I get a distressingly high number of reports that people lost their data; would’ve been nice to have a backup, but nope.

There’s one upside. In the new version of the library, Google allows you to look up the last time each item was bought. So that’s a partial backup; even if all the other data was deleted, the game can award that one purchase. (Actually up to four purchases, one per price point.) That’ll be the plan, anyway.

The problem with templates

Templates are a neat little feature offered by Haxe and enhanced by Lime. As I’ve mentioned before, Lime doesn’t create Android apps directly; it actually creates an Android project and allows the official Android SDK to build the app. Normally, Lime has its own set of templates to build the project from, but you can set your own if you prefer.

That’s how I write Java code. I write Java files and then tell Lime “these are templates; copy them into the project.” Now, Lime doesn’t do a direct copy; it actually does some processing first. This processing step means I get to adjust settings. Sometimes I mark code as being only for debug builds, or only for the Amazon version of the app. (Yeah, it’s on Amazon.)

Essentially, this functions as conditional compilation, which is a feature I use extensively in Haxe code. When I saw the opportunity to do the same thing in Java, I jumped on it.

Problem is, the template files are not (quite) valid Java code. This makes most code helpers nigh-unusable. Well, as far as I know. Since I never expected any to work, I didn’t try very hard to make it happen. Instead, I just coded without any quality of life features (code completion, instant error checking, etc.). Guess what happens when you don’t have quality of life features? Yep, your life quality suffers.

Use the best tools for each part: Xcode to make iOS code, Android Studio to make Android code and VSCode (or your other favourite Haxe editor) to make Haxe code

—Some quality advice that I never followed.

You know, I’ve always hated working on the “mobile” parts of Run Mobile. I’d do it, but only reluctantly, and it’d always be slow going, no matter how simple and straightforward everything seemed on paper. When I was done and could get back to Haxe, it’d feel like a weight off my chest. In retrospect, I think the lack of code completion was a big part of it. (The other part being that billing is outside my comfort zone.)

I’m not going to change up my development process just yet. There’s too much riding on me getting this done, and not nearly enough time to change my development process. But eventually, I’m going to see about writing the Java code in Android Studio.

Conclusion

After even more hours of work, I’ve started to get the hang of the new library, rewritten hundreds of lines of code, and fixed a few dozen compile errors. I’ve removed a few features along the way, but hopefully nothing that impacts end users.

In retrospect, the conversion guide wasn’t very helpful. It provided “before” and “after” code examples, but the “before” code looked nothing like my code, so I couldn’t be sure what to replace with what. The API docs were far more useful – since everything centers on BillingClient, I could always start reading there.

As of right now, the app compiles, but that’s it. When launched, it immediately hangs, probably because of an infinite loop somewhere. Once that’s fixed, it’s on to the testing phase.

Coding in multiple languages

Quick: what programming language is Run 3 written in? Haxe, of course. (It’s even in the title of my blog.) But that isn’t all. Check out this list:

  • Haxe
  • Neko
  • Java
  • JavaScript
  • ActionScript
  • C++
  • Objective-C
  • Python
  • Batch
  • Bash
  • Groovy

Those are the programming languages that are (or were) involved in the development of Run 3/Run Mobile. Some are on the list because of Haxe’s capabilities, others because of Haxe’s limits.

Haxe’s defining feature is its ability to compile to other languages. This is great if you want to write a game for multiple platforms. JavaScript runs in browsers, C++ runs on desktop, Neko compiles quickly during testing, ActionScript… well, we don’t talk about ActionScript anymore. And that’s why those four are on the list.

Batch and Bash are good at performing simple file operations. Copying files, cleaning up old folders, etc. That’s also why Python is on the list: I have a Python file that runs after each build and performs simple file operations. Add 1 to the build count, create a zip file from a certain folder, etc. Honestly it doesn’t make much difference which language you use for the simple stuff, and I don’t remember why I chose Python. Nowadays I’d definitely choose Haxe for consistency.

The rest are due to mobile apps. Android apps are written in Java or Kotlin, then compiled using Groovy. Haxe does compile to Java, but it has a reputation for being slow. Therefore, OpenFL tries to use as much C++ as possible, only briefly using Java to get the game up and running.

iOS is similar: apps are typically written in Objective-C or Swift, and Haxe doesn’t compile to either of those. But you can have a simple Objective-C file start the app, then switch to C++.

Even leaving aside Python, Batch, and Bash, that’s a lot of languages. Some of them are independent, but others have to run at the same time and even interact. How does all that work?

Source-to-source compilation

Let’s start with source-to-source compilation (the thing I said was Haxe’s defining feature) and what it means. Suppose I’m programming in Haxe and compiling to JavaScript.

Now, by default Haxe code can only call other Haxe code. Say there’s a fancy JavaScript library that does calligraphy, and I want to draw some large shiny letters. If I was writing JavaScript, I could call the drawCalligraphy() function no problem, but not in Haxe.

To accomplish the same thing in Haxe, I need some special syntax to insert JavaScript code. Something like this:

//Haxe code (what I write)
var fruitOptions = ["apple", "orange", "pear", "lemon"];
var randomFruit = fruitOptions[Std.int(Math.random() * fruitOptions.length)];
js.Syntax.code('drawCalligraphy(randomFruit);');
someHaxeFunction(randomFruit);

//JS code (generated when the above is compiled)
let fruitOptions = ["apple","orange","pear","lemon"];
let randomFruit = fruitOptions[Math.random() * fruitOptions.length | 0];
drawCalligraphy(randomFruit);
someHaxeFunction(randomFruit);

Note how similar the Haxe and JavaScript functions end up being. It almost feels like I shouldn’t need the special syntax at all. As you can see from the final line of code, function calls are left unchanged. If I typed drawCalligraphy(randomFruit) in Haxe, it would become drawCalligraphy(randomFruit) in JavaScript, which would work perfectly. Problem is, it doesn’t compile. drawCalligraphy isn’t a Haxe function, so Haxe throws an error.

Well, that’s where externs come in. By declaring an “extern” function, I tell Haxe “this function will exist at runtime, so don’t throw compile errors when you see it.” (As a side-effect, I’d better type the function name right, because Haxe won’t check my work.)

tl;dr: Since Haxe creates code in another programming language, you can talk to other code in that language. If you compile to JS, you can talk to JS.

Starting an iOS app

Each Android or iOS app has a single defined “entry point,” which has to be a Java/Kotlin/Objective-C/Swift file. Haxe can compile to (some of) these, but there’s really no point. It’s easier and better to literally type out a Java/Kotlin/Objective-C/Swift file, which is exactly what Lime does.

I’ve written about this before, but as a refresher, Lime creates an Xcode project as one step along the way to making an iOS app. At this point, the Haxe code has already been compiled into C++, in a form usable on iOS. Lime then copy-pastes in some Objective-C files and an Xcode project file, which Xcode compiles to make a fully-working iOS app. (And it’s a real project; you could even edit it in Xcode, though that isn’t recommended.)

And that’s enough to get the app going. When compiled side-by-side, C++ and Objective-C++ can talk to one another, as easily as JavaScript can communicate with JavaScript. Main.mm (the Objective-C entry point) calls a C++ function, which calls another C++ function, and so on until eventually one of them calls the compiled Haxe function. Not as simple as it could be, but it has the potential to be quite straightforward.

Unlike Android.

Shared libraries

A shared library or shared object is a file that is intended to be shared by executable files and further shared object files. Modules used by a program are loaded from individual shared objects into memory at load time or runtime, rather than being copied by a linker when it creates a single monolithic executable file for the program.

Traditionally, shared library/object files are toolkits. Each handles a single task (or group of related tasks), like network connections or 3D graphics. The “shared” part of the name means many different programs can use the library at once, which is great if you have a dozen programs connecting to the net and don’t want to have to download a dozen copies of the network connection library.

I mention this to highlight that Lime does something odd when compiling for Android. All of your Haxe-turned-C++ code goes in one big shared object file named libApplicationMain.so. But this “shared” object never gets shared. It’s only ever used by one app, because, well, it is that app. Everything outside of libApplicationMain.so is essentially window dressing; it’s there to get the C++ code started. I’m not saying Lime is wrong to do this (in fact, the NDK documentation tells you to do it), I’m just commenting on the linguistic drift.

To get the app started, Lime loads the shared object and then passes the name of the main C++ function to SDL, which loads the function and then calls it. Bit roundabout, but whatever works.

tl;dr: A shared library is a pre-compiled group of code. Before calling a function, you need two steps: load the library, then load the function. On Android, one of these functions is basically “run the entire app.”

Accessing Android/iOS features from Haxe

If your Haxe code is going into a shared object, then tools like externs won’t work. How does a shared object send messages to Java/Objective-C? I’ve actually answered this one before with examples, but I didn’t really explain why, so I’ll try to do that.

  • On Android, you call JNI.createStaticMethod() to get a reference to a single Java function, as long as the Java function is declared publicly. Once you have this reference, you can call the Java function. If you want more functions, you call JNI (Java Native Interface) multiple times.
  • On iOS, you call CFFI.load() to get a reference to a single C (Objective or otherwise) function, as long as the C function is a public member of a shared library. Once you have this reference, you can call the C function. If you want more functions, you call CFFI (C Foreign Function Interface) multiple times.

Gotta say, there are a lot of similarities, and I’m guessing that isn’t a coincidence. Lime is actually doing a lot of work under the hood in both cases, with the end goal of keeping them simple.

But wait a minute. Why is iOS using shared libraries all of a sudden? We’re compiling to C++ and talking to Objective-C; shouldn’t extern functions be enough? In fact, they are enough. Shared libraries are optional here, though recommended for organization and code consistency.

You might also note that last time I described calling a shared library, it took extra steps (load the library, load the function, call the function). This is some of the work Lime does under the hood. The CFFI class combines the “load library” and “load function” steps into one, keeping any open libraries for later use. (Whereas C++ doesn’t really do “convenience.”)

tl;dr: On Android, Haxe code can call Java functions using JNI. iOS extensions are designed to mimic this arrangement, though you use CFFI instead of JNI.

Why I wrote this post

Looking back after writing this, I have to admit it’s one of my less-informative blog posts. I took a deep dive into how Lime works, yes, but very little here is useful to an average OpenFL/Lime user. If you want to use CFFI or JNI, you’d be better off reading my old blog post instead.

Originally, this post was supposed to be a couple paragraphs leading into to another Android progress report. (And I’d categorized it under “development,” which is hardly accurate.) But the more I wrote, the clearer it became that I wasn’t going to get to the progress report. I almost abandoned this post, but I was learning new things, so I decided to put it out there.

(For instance, it had never occurred to me that CFFI was optional on iOS. It may well be the best option, but since it is just an option rather than mandatory, I’ll want to double-check.)

Why I haven’t updated Run Mobile in ages (part 1)

Google has announced a November 1 deadline to update Play Store apps, and I’ve been keeping an eye on that one. We’re now getting close enough to the deadline that I’m officially giving up on releasing new content with the update, and instead I’ll just release the same version with the new required library.

But why did it take this long for me to decide that? Why didn’t I do this a year ago when Google made their announcement, and keep working in the meantime? To answer that question, this blog post will document my journey to make a single small change to Run Mobile. The first step, of course, is to make sure it still compiles.

I should mention that I have two computers that I’ve used to compile Android apps. A Linux desktop, and a Windows laptop.

The Linux machine:

  • Is where I do practically all my work nowadays.
  • Performs well.
  • Is the one you’ve seen if you’ve watched my streams.
  • Has never, if I recall correctly, compiled a release copy of Run Mobile.

The Windows machine:

  • Hasn’t seen use in years.
  • Is getting old and slow; probably needs a fresh install.
  • Has the exact code I used to compile the currently-released version of Run Mobile.

Compiling on Windows

I tried this second, so let’s talk about it first. That makes sense, right?

Well, I found that I had some commits I hadn’t uploaded. Figured I’d do that real quick, and it turns out that Git is broken somehow. Not sure why, but it always rejects my SSH key. I restarted the machine, reuploaded the key to GitHub, tried both command line and TortiseGit, and even tried GitHub’s app which promises that you don’t have to bother with SSH keys. Everything still failed. At some point I’ll reinstall Git, but that’s for later. My goal here is to compile.

Fortunately, almost all of my code was untouched since the last time I compiled, and so I compiled to Neko. No dice. There were syntax errors, null pointers, and a dozen rendering crashes. Oh right, I never compiled this version for Neko, because I always targeted Flash instead.

So I stopped trying to fix Neko, and compiled for Android. And… well, there certainly were errors, but I’ll get to them later. Eventually, I fixed enough errors that it compiled. Hooray!

But for some reason I couldn’t get the machine to connect to my phone, so I couldn’t install and test the app. Tried multiple cables, multiple USB ports… nothing. And that was the last straw. This laptop was frustrating enough when Git and adb worked.

Compiling on Linux

Since Git does work on this machine, I was easily able to roll my code back. I’d even made a list of what libraries needed to be rolled back, and how far. (This is far from the first time I’ve had to do this.)

With all my code restored, I tried compiling. Result: lime/graphics/cairo/Cairo.hx:35: characters 13-21 : Unexpected operator. A single error, presumably hiding many more. Out of curiosity, I checked the file in question, expecting to see a plus sign in a weird spot, or an extra pair of square brackets around something, or who knows. Instead I found the word “operator”. (Once again I have fallen victim to the use-mention distinction.) Apparently Haxe 4.0.0 made “operator” a keyword, and Lime had to work around it.

Right, right. I’d gone back to old versions of my code, but I hadn’t downgraded Haxe. I’d assumed doing so would be difficult, possibly requiring me to download an untrustworthy file. This was the point in the process when I tried to compile on Windows instead. As explained above, that fell through, so I came back and discovered I could get it straight from the Haxe Foundation. (I’d been looking in the wrong place.) Once I reverted Haxe, that first error went away.

But that was only error #1. Fixing it revealed a new one, and fixing that revealed yet another. Rinse and repeat for an unknown number of times. Plus side, it’s simpler to keep track of a single error at a time than 20 at once. Minus side, there were a lot of errors.

  1. Type not found: haxe.Exception – Apparently I hadn’t downgraded all my libraries. After some file searching, I found two likely culprits and downgraded both.
  2. Cannot create closure with more than 5 arguments – I’ve never seen this one before, and neither has Google. I never even knew that function bindings were closures. Also, I’m not sure how addRectangle.bind(104, 0x000000, 0) has more than 5 arguments (perhaps it counts the optional arguments). But this wasn’t worth a lot of time, so I used an anonymous function to do the same thing.
  3. Invalid field access : lastIndexOf – This often comes up when a string is null. Can’t search for the last index of a letter if the string isn’t there. Fortunately I’d already run into this bug on Windows and knew the solution. Haxe 3.4 tells you to use Sys.programPath() instead of Sys.executablePath(), except programPath is broken.
  4. Extern constructor could not be inlined – Another error stemming from an old version of Haxe, this comes up when you cache data between compilations. It can be fixed by updating Haxe (not an option in my case) or turning off VSHaxe’s language server.
  5. Invalid field access : __s – Another null pointer error I’d already seen. But it was at this point that I remembered not to try to compile for Neko, so I turned my focus to Android instead.
  6. You need to run "lime setup android" before you can use the Android target – So of course that wasn’t going to be easy either. Apparently I’d never told Lime where to find some crucial files. (Also apparently I’d never downloaded the [NDK](https://developer.android.com/ndk/), meaning I’ve never used this machine to compile for Android.)
  7. Type not found : Jni – Wait, I (vaguely) remember that class. Why is it missing? _One search later…_ Aha, it’s still there, it’s just missing some tweaks I made on the Windows computer. This is all for some debug info that rarely matters, so I removed it for the time being.
  8. arm-linux-androideabi-g++: not found – Uh oh. This is an error in hxcpp, a library that I try very hard to avoid dealing with. Android seems to have retired the “standalone toolchains” feature this old version of hxcpp uses, and I’ve long since established that newer versions of hxcpp are incompatible. Well, I tried using HXCPP_VERBOSE to get more info, and while it helped with a few hidden errors, I spent way too long digging into hxcpp without making much progress. Instead, I went all the way back to NDK r18.
  9. 'typeinfo' file not found – Another C++ error, great. Seems I’m not the first OpenFL dev to run into this one, which is actually good because it lets me know which NDK version I actually need: r15c. The Android SDK manager only goes as far back as r16, so I did a manual download.
  10. gradlew: not found – It might be another “not found” error, but make no mistake, this is huge progress. All the C++ files (over 99% of the game) compiled successfully, and it had reached the Gradle/Java portion, something I’m far more familiar with. Not that I needed to be because, someone else already fixed it. The only reason I was still seeing the error is because I couldn’t use newer versions of OpenFL. One quick copy-paste later, and…
  11. Cannot convert URL '[Kongregate SDK file]' to a file. – No kidding it can’t find the Kongregate SDK; I hard-coded a file path that only exists on Windows. In retrospect I should have used a relative path, but for now I hard-coded a different path. Then, to make absolutely certain I had the right versions, I copied the Kongregate SDK (and other Java libraries) from my laptop.
  12. Could not GET 'https://dl.bintray.com/[...]'. Received status code 403 from server: Forbidden – There were nine or ten of these errors, each with a different url. Apparently Bintray has been shut down, which means everyone has found somewhere else to host their files. I looked up the new URLs and plugged them in. Surprisingly, the new urls worked first try.

And, finally, it compiled.

Closing thoughts

And that’s why I haven’t updated Run Mobile in ages. Every time I try to compile, I have to wade through a slew of bugs. Not usually this many, but there’s always something, and I’ve learned to associate updating mobile with frustration.

I was hoping to avoid this whole process. I’d hoped to finish Runaway, allowing me to use the latest versions of OpenFL, hxcpp, and the Android SDK. But there just wasn’t enough time.

Don’t get me wrong, it feels good to have accomplished this. But as a reminder, I’ve made zero improvements thus far. I haven’t copied over any new content, I haven’t updated any libraries, I haven’t even touched the Google Play Billing Library (you know, the one that must be updated by November). I’ve spent two weeks just trying to get back what I already had.

Maybe I’m being too pessimistic here. I have, in fact, made progress since February 2018. My code now compiles on Linux, unlike in 2018. My 2018 code relied on Bintray, which is now gone. And it’s possible that new content may have been included without me even trying.

And that’s enough for today. Join me next time, on my journey to make a single small change to Run Mobile.

Haxe has too many ECS frameworks

Ever since I learned about them, I’ve wanted to use the entity component system (ECS) pattern to make games. Used properly, the pattern leads to clean, effective, and in my opinion cool-looking code. So when I was getting started my game engine, I looked for an ECS library that to build off of. And I found plenty.

The Haxe community is prolific and enthusiastic, releasing all kinds of libraries completely for free. That’s great, but it’s also a bit of a problem. Instead of working together to build a few high-quality libraries, everyone decided to reinvent the wheel.

xkcd: Standards

It did occur to me that I was preparing to reinvent the wheel, but no one had built a game engine capable of what I wanted, so I went ahead with it. Eventually I realized that’s probably what all the other developers were thinking too. Maybe there’s a reason for the chaos.

Let’s take a look at (eleven of) the available frameworks. What distinguishes each one?

Or if you want to see the one I settled on, skip to Echoes.

Ash

Let’s start at the beginning. Ash was one of the first ECS frameworks for Haxe, ported from an ActionScript 3 library of the same name. Makes sense: Haxe was originally based on AS3, and most of the early developers came from there.

Richard Lord, who developed the AS3 version, also wrote some useful blog posts on what an ECS architecture is and why you might want to use it.

Objectively, Ash is a well-designed engine. However, it’s held back by having started in ActionScript. Good design decisions there (such as using linked list nodes for performance) became unnecessary in Haxe, but the port still kept them in an effort to change as little as possible. This means it takes a bunch of typing to do anything.

//You have to define a "Node" to indicate which components you're looking for; in this case Position and Motion.
class MovementNode extends Node<MovementNode>
{
    public var position:Position;
    public var motion:Motion;
}
//Then systems use this node to find matching entities.
private function updateNode(node:MovementNode, time:Float):Void
{
    var position:Position = node.position;
    var motion:Motion = node.motion;

    position = node.position;
    motion = node.motion;
    position.position.x += motion.velocity.x * time;
    position.position.y += motion.velocity.y * time;
    //...
}

It honestly isn’t that bad for one example, but extra typing adds up.

ECX

ECX seems to be focused on performance, though I can’t confirm or debunk this.

As far as usability goes, it’s one step better than Ash. You can define a collection of entities (called a “Family” instead of a “Node”) in a single line of code, right next to the function that uses it. Much better organized.

class MovementSystem extends System {
    //Define a family.
    var _entities:Family<Transform, Renderable>;
    override function update() {
        //Iterate through all entities in the family.
        for(entity in _entities) {
            trace(entity.transform);
            trace(entity.renderable);
        }
    }
}

Eskimo

Eskimo is the programmer’s third attempt at a framework, and it shows in the features available. You can have entirely separate groups of entities, as if they existed in different worlds, so they’ll never accidentally interact. It can notify you when any entity gains or loses components (and you can choose which components you want to be notified of). Like in ECX, you can create a collection of components (here called a View rather than a Family) in a single line of code:

var viewab = new View([ComponentA, ComponentB], entities);
for (entity in viewb.entities) {
    trace('Entity id: ${entity.id}');
    trace(entity.get(ComponentB).int);
}

The framework has plenty of flaws, but its most notable feature is the complete lack of macros. Macros are a powerful feature in Haxe that allow you to run code at compile time, which makes programming easier and may save time when the game is running.

Lacking macros (as well as the “different worlds” thing I mentioned) slows Eskimo down, and makes it so you have to type out more code. Not as much code as in Ash, but it’s still inconvenient.

Honestly, though, I’m just impressed. Building an ECS framework without macros is an achievement, even though the framework suffers for it. Every single one of the other frameworks on this list uses macros, for syntax sugar if nothing else. Even Ash uses a few macros, despite coming from AS3 (which has no macros).

edge

edge (all lowercase) brings an amazing new piece of syntax sugar:

class UpdateMovement implements ISystem {
    function update(pos:Position, vel:Velocity) {
        pos.x += vel.vx,
        pos.y += vel.vy;
    }
}

You no longer have to create a View yourself, or iterate through that view, or type out entity.get(Position) every time you want to access the Position component. Instead, just define an update function with the components you want. edge will automatically give you each entity’s position and velocity. You don’t even have to call entity.get(Position) or anything; that’s already done. This saves a lot of typing when you have a lot of systems to write.

edge also provides most of the other features I’ve mentioned so far. Like in Eskimo, you can separate entities into different “worlds” (called Engines), and you can receive notifications when entities gain or lose components. You can access Views if needed/preferred, and it only takes a line of code to set up. Its “World” and “Phase” classes are a great way to organize systems, and the guiding principles are pretty much exactly how I think the ECS pattern should work.

Have I gushed enough about this framework yet? Because it’s pretty great. Just one small problem.

A system’s update function must be named update. A single class can only have one function with a given name. Therefore, each system can only have one update function. If you want to update two different groups of entities, you need two entire systems. So the syntax sugar doesn’t actually save that much typing, because you have to type out an entire new class declaration for each function.

Eventually, the creator abandoned edge to work on edge 2. This addresses the “one function per system” problem, though sadly in its current state it loses all the convenience edge offered. (And the lack of documentation makes me think it was abandoned midway.)

Baldrick

Baldrick is notable because it was created specifically in response to edge. Let’s look at the creator’s complaints, to see what others care about.

• It requires thx.core, which pulls a lot of code I don’t need

That’s totally fair. Unnecessary dependencies are annoying.

• It hasn’t been updated in a long time, and has been superceded by the author by edge2

It’s always concerning when you see a library hasn’t been updated in a while. This could mean it’s complete, but usually it means it’s abandoned, and who knows if any bugs will be fixed. I don’t consider this a deal-breaker myself, nor do I think edge 2 supersedes it (yet).

• Does a little bit too much behind-the-scenes with macros (ex: auto-creating views based on update function parameters)

Oh come on, macros are great! And the “auto-creating views” feature is edge’s best innovation.

• Always fully processes macros, even if display is defined (when using completion), slowing down completion

I never even thought about this, but now that they mention it, I have to agree. It’s a small but significant oversight.

• Isn’t under my control so isn’t easily extendable for my personal uses

It’s… open source. You can make a copy that you control, and if you can submit your changes back to the main project. If you’re making changes that aren’t worth submitting, you’re probably making the wrong changes. Probably.

• Components and resources are stored in an IntMap (rather than a StringMap like in edge)

This is actually describing what Baldrick does, but it still mentions something edge does wrong. StringMap isn’t terrible, but Baldrick’s IntMap makes a lot more sense.

Anyway, Baldrick looks well-built, and it’s building on a solid foundation, but unfortunately it’s (quite intentionally) missing the syntax sugar that I liked so much.

var movingEntities:View<{pos:Position, vel:Velocity}> = new View();

public function process():Void {
    for(entity in positionEntities) {
        entity.data.pos.x += entity.data.vel.x;
        entity.data.pos.y += entity.data.vel.y;
    }
}

That seems like more typing than needed – entity.data.pos.x? Compare that to edge, which only requires you to type pos.x. I suppose it could be worse, but that doesn’t mean I’d want to use it.

Oh, and as far as I can tell, there’s no way to get delta time. That’s inconvenient.

exp-ecs

Short for “experimental entity component system,” exp-ecs is inspired by Ash but makes better use of Haxe. It does rely on several tink libraries (comparable to edge’s dependency on thx.core). The code looks pretty familiar by now, albeit cleaner than average:

@:nodes var nodes:Node<Position, Velocity>;

override function update(dt:Float) {
    for(node in nodes) {
        node.position.x += node.velocity.x * dt;
        node.position.y += node.velocity.y * dt;
    }
}

Not bad, even if it isn’t edge.

Under the hood, it looks like component tracking is slower than needed. tink_core‘s signals are neat and all, but the way they’re used here means every time a component is added, the entity will be checked against every node in existence. This won’t matter in most cases, but it could start to add up in large games with lots of systems.

Ok, I just realized how bad that explanation probably was, so please enjoy this dramatization of real events instead, featuring workers at a hypothetical entity factory:

Worker A: Ok B, we just added a Position component. Since each node needs to know which entities have which components, we need to notify them.

Worker B: On it! Here’s a node for entities with Hitboxes; does it need to be notified?

Worker A: Nope, the entity doesn’t have a Hitbox.

Worker B: Ok, here’s a node that looks for Acceleration and Velocity; does it need to be notified?

Worker A: No, the entity doesn’t have Acceleration. (It has a Velocity, but that isn’t enough.)

Worker B: Next is a node that looks for Velocity and Position; does it need to be notified?

Worker A: Yes! The entity has both Velocity and Position.

Worker B: Here’s a node that needs both Position and Appearance; does it need to be notified?

Worker A: No, this is an invisible entity, lacking an Appearance. (It has a Position, but that isn’t enough.)

Worker B: Ok, next is a node for entities with Names; does it need to be notified?

Worker A: It would, but it already knows the entity’s Name. No change here.

Worker B: Next, we have…

This process continues for a while, and most of it is totally unnecessary. We just added a Position component, so why are we wasting time checking dozens or hundreds of nodes that don’t care about Position? None of them will have changed. Sadly, exp-ecs just doesn’t have any way to keep track. It probably doesn’t matter for most games, but in big enough projects it could add up.

(Please note that exp-ecs isn’t the only framework with this issue, it’s just the one I checked to be sure. I suspect the majority do the same thing.)

On the plus side, I have to compliment the code structure. There’s no ECS framework in existence whose code can be understood at a glance, but in my opinion exp-ecs comes close. (Oh, and the coding style seems to perfectly match my own, a coincidence that’s never happened before. There was always at least one small difference. So that’s neat.)

Cog

Cog is derived from exp-ecs, and calls itself a “Bring Your Own Entity” framework. You’re supposed to integrate the Components class into your own Entity class (and you can call your class whatever you like), and now your class acts like an entity. I don’t buy it. Essentially their Components class is the Entity class, they’re just trying to hide it.

As far as functionality, it unsurprisingly looks a lot like exp-ecs:

@:nodes var movers:Node<position, velocity>;
override public function step(dt:Float) {
    super.step(dt);
    for (node in movers) {
        node.position.x += node.velocity.x * dt;
        node.position.y += node.velocity.y * dt;
    }
}

I was pleasantly surprised to note that it has component events (the notifications I talked about for Eskimo and edge). If Cog had existed when I started building Runaway, I would have seriously considered using it. In the end I’d probably have rejected it for lack of syntax sugar, but only barely.

Awe

Awe is a pseudo-port of Artemis, an ECS framework written in Java. I’m not going to dig deep into it, because this is the example code:

var world = World.build({
    systems: [new InputSystem(), new MovementSystem(), new RenderSystem(), new GravitySystem()],
    components: [Input, Position, Velocity, Acceleration, Gravity, Follow],
    expectedEntityCount: ...
});
var playerArchetype = Archetype.build(Input, Position, Velocity, Acceleration, Gravity);
var player = world.createEntityFromArchetype(playerArchetype);

Java has a reputation for being verbose, and this certainly lives up to that. I can look past long method names, but I can’t abide by having to list out every component in advance, nor having to count entities in advance, nor having to define each entity’s components when you create that entity. What if the situation changes and you need new components? Just create a whole new entity I guess? This engine simply isn’t for programmers like me.

That said, the README hints at something excellent that I haven’t seen elsewhere…

@Packed This is a component that can be represented by bytes, thus doesn’t have any fields whose type is not primitive.

…efficient data storage. With all the restrictions imposed above, I bet it takes up amazingly little memory. Sadly this all comes at the cost of flexibility. It reminds me of a particle system, packing data tightly, operating on a set number of particles, and defining the limits of the particles’ capabilities in advance.

OSIS

OSIS combines entities, components, systems, and network support. The networking is optional, but imposes a limitation of 64 component types that applies no matter what. (I’ve definitely already exceeded that.) I don’t have the time or expertise to discuss the can of worms that is networking, so I’ll leave it aside.

Also notable is the claim that the library “avoids magic.” That means nothing happens automatically, and all the syntax sugar is gone:

var entitySet:EntitySet;

public override function init()
    entitySet = em.getEntitySet([CPosition, CMonster]);

public override function loop()
{
    entitySet.applyChanges();

    for(entity in entitySet.entities)
    {
        var pos = entity.get(CPosition);
        pos.x += 0.1;
    }
}

I have to admit this is surprisingly concise, and the source code seems well-written. The framework also includes less-common features like component events and entity worlds (this time called “EntityManagers”).

I still like my syntax sugar, I need more than 64 components, and I don’t need networking, so this isn’t the library for me.

GASM

According to lib.haxe.org, GASM is the most popular haxe library with the “ecs” tag. However, I am an ECS purist, and as its README states:

Note that ECS purists will not consider this a proper ECS framework, since components contain the logic instead of systems. If you are writing a complex RPG or MMO, proper ECS might be worth looking in to, but for more typical small scale web or mobile projects I think having logic in components is preferable.

Listen, if it doesn’t have systems, then don’t call it “ECS.” Call it “EC” or something.

It seems to be a well-built library, better-supported than almost anything else on this list. However, I’m not interested in entities and components without systems, so I chose to keep looking.

Ok, so what did I go with?

Echoes

Echoes’ original creator described it as a practice project, created to “learn the power of macros.” Inspired by several others on the list, it ticked almost every single one of my boxes.

It has syntax sugar like edge’s (minus the “one function per system” restriction), no thx or tink dependencies, yes component events, convenient system organization, and a boatload of flexibility. Despite deepcake’s (the creator’s) modesty, this framework has a lot to it. It received 400+ commits even before I arrived, and is now over 500. (Not a guarantee of quality, but it certainly doesn’t hurt.)

Echoes’ performance

I haven’t seriously tested Echoes’ speed, but deepcake (the original dev) made speed a priority, and I can tell that it does several things right. It uses IntMap to store components, it keeps track of which views care about which components (meaning it’s the first one I’m sure doesn’t suffer from the problem I dramatized in the exp-ecs section), and it does not let you separate entities into “worlds.” The latter is a good feature to have, but incurs a small performance hit. I’m not saying it’s good to be missing a feature, I’m just saying it’s faster (and I haven’t needed it yet).

Echoes’ flexible components

Let’s talk about how components work. In every other framework I’ve discussed thus far, a component must be a class, and it must extend or implement either “Component” or “IComponent,” respectively. There’s a very specific reason for these restrictions, but they still get in the way.

For instance, say you wanted to work with an existing library, such as—oh, I don’t know—Away3D. Suppose that Away3D had a neat little Mesh class, representing a 3D model that can be rendered onscreen. Suppose you wanted an entity to have a Mesh component. Well, Mesh already extends another class and cannot extend Component. It can implement IComponent, but that’s inconvenient, and you’d have to edit Mesh.hx. (Which falls squarely in the category of “edits you shouldn’t have to make.”) Your best bet is to create your own MeshComponent class that wraps Mesh, and that’s just a lot of extra typing.

In Echoes, almost any Haxe type can be a component. That Mesh? Valid complement, no extending or implementing necessary. An abstract type? Yep, it just works. That anonymous structure? Well, not directly, but you can wrap it in an abstract type. Or if wrapping it in an abstract is too much work, make a typedef for it. (Note: typedefs don’t work in deepcake’s build, but they were the very first thing I added, specifically because wrapping things in abstracts is too much work.)

All this is accomplished through some slightly questionable macro magic. Echoes generates a lot of extra classes as a means of data storage. For instance, Position components would be stored in a class named ContainerOfPosition. Echoes does this both to get around the “extend or implement” restriction, and because it assumes that it’ll make lookups faster. This may well be true (as long as the compiler is half-decent), it’s just very unusual.

Echoes: conclusion

I settled on Echoes for the syntax sugar and the component events. At the time, the deciding factor was component events, and I hadn’t realized any other libraries offered those. So… whoops.

I don’t regret my choice, at all. The syntax sugar is great, abstract/typedef support is crucial, and the strange-seeming design decisions hold up better than I first though. I know I found one I’m happy with, but it’s a shame that all the others fall short in one way or another…?