Senior Platform Engineer

Bonsai is hiring a Senior Platform Engineer to help build, scale and support the underlying technical platform that help us manage thousands of Elasticsearch clusters on AWS and GCP. Applications are open for the next two weeks, with a target start date of mid June to early July

About the job

“Hey, we’ve put your add-on in production. Good luck. Don’t crash.” —Heroku

The essence of platform engineering at Bonsai will be to operate and support Elasticsearch at scale. The emphasis here is more on the scale part than the Elasticsearch part, but you’ll definitely become intimate with Elasticsearch and Lucene along the way.

There are several key components involved. First we have Elasticsearch itself. Then a handful of proprietary plugins to enhance its functionality and support its operation. From there, the networking stack that handles connections and does diagnostic tracing. Telemetry and observability across the board. Finally our packaging and deployment, and internal services for fleet orchestration.

If that sounds like more than one person’s job, we agree. Dan is going to be particularly stoked to work with you 😉

You can think of this similar to a “SRE” position. When there’s an issue with performance or reliability, you’ll dig in and trace requests and analyze from load balancers down to memory managers, and help code and ship a patch to make it visible, and make it better.

There’s a heavy dose of Java and Linux involved in all of this, but if you have some experience in systems programming in other languages, we can certainly teach all of that.

We’re a small team, but we punch above our weight in systems engineering and operations. Launching at the right place and time dropped us into the deep end of early adopters, and we’ve been scaling ever since. Fortunately our early team was heavily engineering minded. Our original founder was previously a database engineer at Twitter when they went through their years of crazy scaling. We also hosted some massive sites like Pinterest whose 100x growth on our platform was a true trial by fire.

This position does involve wearing the metaphorical pager in a rotation with other engineers on the team. We’re on call not because we expect to be woken up, but so we’re accountable to shipping systems that never need us to!

Some example projects

  • Moving decentralized server-initiated threshold alerts into a centralized time-series stream analysis service.
  • Building a continuous delivery service that performs gradual fleetwide rollouts of new and updated services, subject to canary stages and operational verification at certain checkpoints.
  • Build and package new versions of Elasticsearch OSS, and update our suite of plugins to use the latest plugin interfaces, including customer-supplied proprietary plugins.
  • Troubleshoot a customer-supplied Elasticsearch plugin with a performance hot-spot, trace the problem to a likely location and provide support and guidance to improve efficiency.
  • Diagnose a server-side agent as having problematic memory usage, and port it from Ruby to Crystal to improve performance and resource usage.
  • Collaborate with Product engineers to build a data pipeline to support customer-facing metrics graphs.
  • Assist our customer support by triaging operational incidents and performing incident response.

The ideal person

We’re looking for someone experienced, who’s ready to dive in. You don’t need to be an Elasticsearch expert — you’ll learn all of that on the job. We’ll have plenty of conversations about how Lucene is really a data structures library optimized for disk access.

Experience with Java is more helpful, although C, C++ and Golang would be a good starting points. We’ll also be looking for solid fundamentals in networking, disk access, memory management, and schedulers.

Several of our systems make heavy use of Netty, as does Elasticsearch itself. So familiarity with Netty or evented systems will be helpful.