My day job involves working on GNATSAS, a static analysis suite for the Ada programming language. GNATSAS aggregates four different engines: CodePeer (a sound static analyzer relying on abstract interpretation), Infer (Facebook's lightweight static analyzer), GNAT (the Ada compiler based on GCC) and GNATcheck (a custom "programming language" allowing pattern matching on ASTs, similar to codeql or semgrep).
This aggregation is performed by a wrapper written in Ada. A long time ago, its only job was to run CodePeer itself. Over the years, with the new engines, new features and new this-behavior-is-terrible-but-removing-it-would-break-backward-compatibility, the wrapper became less and less ergonomic. It was decided that the wrapper would be rewritten and work toward this goal started at the beginning of 2022.
One of the decisions made at the time was that the wrapper would be rewritten in Ocaml. Ada has a lot of nice features, but when you don't have any performance or real-time constraints, a garbage-collected language is often a better choice. We already had Ocaml in our toolchain due to the work we're doing with Infer, so Ocaml seemed like a nice fit.
Things progressed nicely through the year and by november we had almost as many passing tests on the revamp branch than on the main branch. And then, out of nowhere, the revamp branch regressed, but only on Windows. The failing tests all worked in the same way: they were simple posix shell scripts that echoed messages, ran the wrapper, echoed some more, ran the wrapper again and so on. They also failed in the same way: some messages were echoed before a command had run! For example, the following test:
echo "The wrapper does something:" wrapper --switch echo "And then does something else:" wrapper --flag
Would be expected to have the following output:
The wrapper does something: [wrapper --switch output] And then does something else: [wrapper --flag]
But in fact had the following one:
The wrapper does something: And then does something else: [wrapper --switch output] [wrapper --flag output]
Now, like any expert programmer, my first reaction was to assume than something else than my program had a bug. In this case, I decided to blame the shell running the script. You see, on Windows, the testsuite's shell scripts aren't run with bash but with GSH, a posix shell tailored to Windows in order to get better performance.
Surely GSH was to blame? I even built a nice theory: there could be a race condition between GSH's execution of built-ins (echo in this specific case) and "regular programs". I was even able to confirm that theory: replacing the calls to echo with env echo was enough to make the problem disappear. Case closed, somebody else fucked up, I'll just open a ticket and forget about it all.
Except... There hasn't been a single commit in GSH for more than a year. Nothing could explain this change in behavior. On top of that, GSH is written in Ada and the title of this article blames Ocaml, so Chekhov's gun has yet to be fired. The investigation must continue.
GSH really wasn't to blame: the failures could be reproduced with a cygwin bash. But I'm still an expert programmer and I still had to blame someone else than me. Armed with this knowledge, I decided to go farther and blame the testsuite runner itself. It was an even better fall guy: a commit that changed the way a script that sets up the test environment was generated had been merged recently! Overjoyed and with the energy of someone who still had the hope of being able to stop working on friday at 6PM, I reverted the test runner's commit and re-launched one of the tests.
Now, you can probably guess what happened. The test still failed. Out of energy and out of fall guys, I decided to take responsibility. The problem had to be in the wrapper. I decided to think really had and then it dawned on me: could the wrapper itself be exiting before its children had time to emit their output? That's still a race-condition, as I first suspected, and this time it was shell-agnostic!
Armed with this theory, I opened wrapper.ml still secretely looking for a way to blame something else than the wrapper's code. After spending a while reviewing our uses of Ocaml's Unix.waitpid and imagining the wildest theories, I decided to settle for a much more innocent-looking line:
Unix.execvp tool (Array.of_list args)
Could Ocaml's Unix.execvp be to blame? I knew Windows handled processes differently from Linux, but the documentation still made a promise:
val execvp :
string -> string array -> 'a
Unix.execv, except that
the program is searched in the path.
val execv :
string -> string array -> 'a
execv prog args execute the program in file
args, and the current process environment.
execv* functions never return: on success, the current
program is replaced by the new one.
So, really, it's impossible for the wrapper to exit before its child. But I decided to trust my gut instead of the documentation and replaced the execvp call by Unix.create_process + Unix.waitpid call, just to be sure. And the bug suddenly disappeared.
The satisfaction almost made working on a saturday morning worth it. Almost. And I still didn't understand what was going on. But I was fed up and thus submitted my patch for review. This was the right call, as my reviewer's fear of the unknown pushed him to figure out the missing pieces:
The OCaml runtime calls _wexecv after a call to a exec* function in Windows. And Stack Overflow, its contents be blessed, reveal that:
Using the _wexec* family of functions (or _P_OVERLAY mode with _wspawn*) in Windows is generally a bad idea, especially for console applications (the default link target for the [w]main entry point). NT has no equivalent to the exec* family implemented for Windows processes, so the CRT simply spawns a new process and exits the current process. If a console-based shell is waiting on the current process, it will resume its standard I/O REPL, and now we have a mess on our hands, with two processes writing to the console and competing for access to console input.
I was not aware of this at all. At this stage, sending a clarification PR on Unix.mli is probably the kind thing to do, for the poor soul that will be bitten next...
And so, calling CreateProcess and waiting on its return code (which our 'spawn' function does) is indeed the right thing to do.
In a perfect future, WSL is available to all of our clients and we can just drop the Windows version of our software. But has the world ever shown any desire to strive for perfection? It's not the first time Windows hurts me and I just know it's not the last either.