How stack space usage creeps up on you, and PIMPL in C++
Introduction
In my team, the servers follow design based on thread-pools, where each light-weight thread can be utilized to handle incoming RPCs quickly. In order to hold many individual light-weight handlers, the stack space allocated for each thread is default to 64KB (as opposed to 1MB in typical C++ runtime).
We do not throw any warnings when stack space was depleted. Instead, the whole process will be intentionally crashed.
Usually, this stack space is plenty, but this is not always the case. In fact, one of my favorite outages was triggered by this.
Motivation: Crash-loop outage
This outage took place a couple years ago, when I first joined the team. In short, we were seeing almost all of our production configuration management jobs crash-looping.
After some initial triage through poking around the coredump, query logs, and rollout history, we had some clearer understanding of what’s going on:
- Some user queries seem to have triggered stack overflow. They are all deeply nested statements that require some heavy stack space usage during our compilation. However, the same queries used to pass without any crashes. 
- There was a rollout between previous query evaluation and the current crashes. 
Hmm… this rollout certainly looks suspicious.
An innocous no-op change
After some digging through the changelog for the release and profiling, one specific code change stood out.
The change added a conditional branch along the code path that crashed. Within this conditional branch, a new database wrapper conditional wrapper was added, which is a pretty heavy object taking up around 10KB of stack space, in the form of:
class BadWrapper {
 public:
  BadWrapper(Database db, Options options)
 // ... Generic init and get/set
 // A ton of heavy db wrapper in the local fields
 private:
  Type1Reader reader1_;
  Type2Reader reader2_;
  // ... and quite a few other readers
}To the reviewer, this CL looks innocous at first; after all, the object creation was guarded behind a conditional code path that was not turned on:
// ... existing logic
if (options.wrapper_feature_flag) {
  BadWrapper wrapper(db, options);
  // ... new, turned off code path
}
// ... existing logicThey have forgotten, however, that the conditional here was a runtime variable, as opposed to a compile time constant. The “ignorant“ compiler had no choice but to allocate additional stack space for this new object, even though it would never have been invoked at runtime!
Fix, and PIMPL in C++
How do we fix this? We could try to somehow refactor the code path and rewrite this heavy object to reduce its stack space usage, but that’s gonna be a pretty major undertaking.
Instead, we can utilize something called PIMPL (Pointer to implementation):
"pImpl" is a C++ programming technique that removes implementation details of a class from its object representation by placing them in a separate class, accessed through an opaque pointer
As described here, this technique is usually used to hide the implementation details and removes compilation dependency. As a trade-off, we would incur some pointer access overhead, and additional heap allocation. This is totally acceptable in our case: the code path isn’t very latency sensitive, and moving the huge object from the stack to heap will directly stop the crash loop.
In the end, the code was changed into something like this:
class GoodWrapper {
 public:
  GoodWrapper(Database db, Options options)
 // ... Generic init; get/set now uses the |impl_| field.
 private:
  // This is the "IMPL" in "PIMPL"
  struct Impl{
    Impl(Database db, Options options);
    Type1Reader reader1;
    Type2Reader reader2;
    // ... and quite a few other readers
  };
  // This is the "P" in "PIMPL".
  std::unique_ptr<Impl> impl_;
}Sure enough, after the fix was rolled out, crashes almost immediately stopped.
Besides the immediate fix, we also had to re-examine the general stack usage in these light-weight threads. The fact that we only had less than 10KB stack space wiggle room was concerning, and fixes related to nested query compilation were implemented as well. That might be an interesting topic to go into for another time.
Thanks for reading:)


