Kafka Summit Logo
Organized by

Kafka Summit San Francisco 2017

Aug 28, 2017 | San Francisco

« Back to Job Board

Site Reliability Engineer / DataOps (Treasure Data)

We are seeking a Site Reliability Engineer to join our growing SRE team as the 3rd member, playing an essential role in maturing the company’s approach to service reliability and continuity.

You and your team will be directly responsible for solutions around the reliability of the platform, including availability, latency, performance, efficiency, capacity planning, and incident response.

You will be required to work with engineering teams on complex problems/projects where analysis of situations or data requires an in-depth evaluation of multiple factors where you must be able to make wise trade-offs between competing factors.

You have a passion for helping others and helping them making their lives better: in doing so, you seek to simplify complex systems to make them understandable and operable. You are able to effectively communicate decisions, ideas, designs, and operation of systems and services to others in a clear and concise manner.

You are both a generalist, capable of picking up and working with multiple, disparate systems, and an expert, having an ability to dive deep into specific topics and quickly master them. You comfortably move between system, service, and instance level views.

You have a love of stateful systems containing Treasured data, ensuring we continue to protect customer data from loss occurring from outages.

Things you will do:

  • Help us measure and improve reliability across the product line by working with product owners and engineering teams.
  • Improve and maintain site reliability, availability, scalability, and system performance.
  • Investigate system performance, errors, and problems.
  • Make wise decisions balancing availability and delivery, and communicating those decisions clearly.
  • With our team, be responsible for the systems you build.
  • Work with engineering teams as a subject matter expert on operating software and systems at scale, teaching them from your experience or know-how, and helping them reach their goals.

Your background and skills will include:

  • A minimum of 5+ years relevant working experience.
  • Excellent knowledge and experience in Systems Engineering, Administration, and Operations.
  • Strong Software Engineering experience, ability to work in multiple programming languages.
  • Experience with Distributed Systems and operating them as they scale.
  • Experience operating services running in the cloud (AWS primarily) or virtualized API-driven platforms.
  • Articulate and personable with strong spoken and written English language abilities.
  • Demonstrate the ability to work independently and collaboratively as part of a specialized team.

We would be thrilled if you:

  • Have experience automating datastore operations or datastores as a service.
  • Were well versed in PostgreSQL database management.
  • Had experience analyzing system-wide performance: latency, throughput, and efficiency.
  • Have experience working as part of a distributed or partially distributed team and thrive in an a highly collaborative and communicative work environment.
  • Pride yourself on giving back to your community: open source contributions, speaking, teaching, mentoring, helping others.


We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.