#MesosCon2015 – Siri

#MesosCon2015 – Siri

Part 3 of my 4 part series on #MesosCon2015: Apple’s Siri, presented by Robert Lacroix and Brain Sumner.

Among all the great examples of ways companies were using Mesos and all of the projects that improve or simplify using it, this fact alone — Apple uses it — was the eye opener that told me “okay, this is really good stuff.” Maybe I’m a little bit of a fanboy. I still think it’s a valid data point. Incidentally, I would have loved to have been at the meeting where they decided on the name “Siri”. I imagine it went something like this: “Wow, that’s an incredible idea! What are we going to call it?” “How about Siri?” “Are you kidding?” “No, I’m totally Sirious.”

Siri as we now know it is the third generation of Siri. It is built on Mesos, dynamic, highly distributed, and includes 100+ different services.

Siri faced some common challenges in previous generations. It required high elasticity to scale with demand, much of the existing code was Siri-specific, it was complex, and there was a high degree of operational overhead. To overcome these challenges, first they built a working proof-of-concept on Mesos. They then socialized the idea internally, deploying to a full-scale QA environment and testing it out with a friendly audience. Finally, they committed to a timeline and plan to get it to production. This allowed them to set a bar and ship it, rather than eternally iterating on making it better.

The impact of making the move was positive. It gave them bare metal performance. It was less complex, providing fewer failures. The platform was application aware, dynamically allocating resources as needed, without being application specific. They saw faster deployments, cost reductions, and shorter time to production.
The big picture of what they learned is that Mesos scales. Some tips: Don’t change agent attributes. Don’t go crazy if the master goes down. Keep your agents running. Use newer kernels. Set proper timeouts (I don’t recall if it was from this talk or some others, but I heard more than once that a reasonable timeout was on the scale of months or years).

They recommended a few application considerations: Make few assumptions about the runtime environment. Use service discovery. Provide Instrumentation.

Apple created their own scheduler for Mesos, before some of the current OSS offerings such as Singularity became available. Their scheduler uses Java 8 to take advantage of the lambdas and stream API. It is completely generic, and provides a REST API. Notably, it features an implementation of the notion of task dependencies between services, providing the ability to deploy a backend before a frontend, for example.

They also implemented a novel strategy for zero downtime with fast-rolling restarts. When an offer is accepted, the task runs. When health checks pass, capacity is then announced and the resources are then dynamically shifted from old process to new processes. This allows faster restarts, starting new services and stopping old ones as fast as the framework will allow individual services to start, pass tests, and then stop corresponding deprecating services.

Some of the challenges they faced with their scheduler were performance at scale, reconciliation, getting the rolling restart logic reliable, process management in the executor, and maintenance of the state. They found that using unique task ids aided immensely, unit and integration tests were essential, and it is vitally important to decline offers permanently. The last part, declining offers permanently, we were told was a workaround until optimistic offers are an available feature.

Stay tuned for the final post in the series where I talk about Adrian Cockcroft’s keynote presentation! In the meantime, request a demo of the platform.