How meteoric success led to feature flag flameout

“The company ran into all the problems that Google, Facebook, and Uber had already faced.”

I was the tech lead for Uber’s self-driving group for two and a half years. That gave me two rare opportunities. First, I got to use Uber’s comprehensive internal tool for dynamic configuration. Second, I got to connect with high-level engineers and managers both inside and outside the company who told me their stories. Lekko became my quest to bring the level of dynamic configuration management I had at Uber to every company.

They didn’t think they’d need all that

A few people at one company told me a story I’ve learned is all too common. When they started, there were ten to fifteen engineers. Like most small startups in 2018-19, they figured out that most companies were using the same feature management tool, aptly considered the best available.

They knew that there were some problems with building a future flagging system. They decided, well at this point we don't need or want to build a versioning tool. They didn’t want to build complex targeting software themselves. So they adopted the accepted solution.

It worked well until their own product proved a huge hit. Before they knew it, they hadmultiple organizations within the company that barely spoke to each other. As you start having these  siloed organizations, you start causing outages. There are feature flags that you didn’t clean  up. You have to start planning and scheduling a cleanup day.

But the big thing that happened was a major outage that took down customers because a PM wanted to try to roll out something that wasn't ready. That's a big known, of course. It’s fine if engineers screw something up — that's their job. But a PM should not be taking down prod. Customers were stopped cold. They couldn’t do compute on AWS.

Success brings cascading fiascos

At the same time, the tool was getting too expensive. The vendor charges per seat. With a thousand seats you get an astronomical bill, literally millions of dollars per year. So they decided only the engineering managers are going to have seats.

One that happened, the company lost all sense of control, since only a few eng managers have access to know what’s going on or make changes.

This led to a second major decision: They decided to rip out the tool. They turned everything off and migrated configuration management back into the code.

Of course, the engineers found this too slow and frustrating. They pulled a hackathon project to developer their own dynamic config system. It’s S3 based, it's in GitHub, all the right ways to architect it. It worked just fine for some engineers, who got to dynamically configure code again.

But the company then ran into all the problems that Google, Facebook, and Uber had already faced, forcing them to separately invent the same solution for themselves. The company needed to build all sorts of permissioning. They need to build integration testing. Like other companies that blow up big, they built a system from scratch that they couldn’t buy at a reasonable price.

But just like at Uber, their solution grew by necessity into a huge distributed system. They not only had multiple developers building it, but dedicated operational staff to maintain it.

Lekko is designed to grow as big as you do

You can see the inevitable sequence of events: First, they spent millions of dollars for a commercial tool that didn’t scale well — certainly not on price. Then they spent millions of dollars dealing with the problems that tool caused and eventually removing it. They then spent millions more reinventing the wheel and maintaining it.

It’s a worst-case scenario, but it’s all too common and happening more often as dynamic configuration becomes a must for most software companies. Six years after Pete Hodgson wrote his viral post on feature toggles (they weren’t yet feature flags to everyone), flags and dynamic configuration aren’t the new thing. They’re the only way to build. As a result what was one company’s cascading-errors story a few years ago is now a regular occurrence at newer startups whose products catch on as virally as they dared dream.

Lekko was built with those lessons in mind. Not only does dynamic configuration have to be smart and reliable, it has to be architected and productized so that it can scale with a scrappy startup that grows into a global household name.