#MesosCon2015 – Hubspot

#MesosCon2015 – Hubspot

In part 2 of my #MesosCon blog posts, Tom Petr at Hubspot brought an entertaining exposition of their stack and how they got there, with highlights of the trials and tribulations along the way.

The main take away was that they built a platform where the engineers own the end-to-end success of their products. The engineers make the tech decisions, and they wear the pagers.

When they started out, Hubspot developers developed locally, provisioned QA hardware, deployed via local python script, provisioned production hardware, deployed via local python script…and replaced hardware at 4am.

In the new way of doing things, they created a team responsible for the development and maintenance of a PaaS which empowers engineers with good tools and a solid foundation. They use Mesos to abstract away machines and promote a homogeneous environment with the ability to scale out specific processes and a centralized service repository.

Fun fact: The new ec2 M4 class machines were based on feedback from Hubspot!

Moving to a Mesos infrastructure was not universally accepted at first.

Some of the problems they encountered included discovering that some service *did* in fact depend on local filesystem state, when they thought they had eliminated every instance of this, keeping a single process running, stationary hosts, and isolating memory. To solve the memory problems, they added a simple algorithm to their scheduler to determine how much memory to allocate to a process: max heap + stack size * expected max thread + GC overhead pct * max heap + jvm overhead + extra off-heap memory = no more OOMS!

I may have transcribed that formula incompletely.

The point was that you can and should estimate the required resources your processes will need and allocate for them consistently. Before moving to Mesos, engineers were guessing about what size instances their applications would need, with slow turn around and few options for graceful recovery.

Their biggest fear was inconsistency. Going from a world where everything was really direct–ssh’ing and running scripts–to one where you create a service and trust that it will just run requires a certain leap of faith. What they learned along the way was that you had to do this in order to scale the system and and still allow rapid development.

To build confidence in the new system, they implemented at two-phase commit process in which Singularity reaches out to Baragon which adds these hosts to the load balancer and only when Baragon says “all hosts are set successfully” then the deploy is considered a success.

When they did get buy-in from developers to switch to the new Mesos framework, what they saw was pretty cool. July 2014 was when they started transitioning. February 2015 they completed the migration. Apologies, for my potato photo.

So, in addition to creating a consistent environment that empowered their developers to create a great product, they ended up vastly increasing their efficiency and saving a bunch of money. Nice!

Tom offered one last bit of advice about deploying a new platform:

“Finally, when you’ve worked out the kinks and you’re ready to roll it out, give it a new version number. Engineers love new version numbers.”

Request a demo of the People Pattern platform here.