|
Search Jobvertise Jobs
|
Jobvertise
|
Site Reliability Engineer Location: US-WA-Redmond Email this job to a friend
Report this Job
Responsibilities include but are not limited to: - Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service.
- Design and implement Disaster Recovery and Business Continuity plans.
- Collaborate with engineering teams to build and enhance tooling and automation solutions that help achieve SLOs and improve customer supportability.
- Work closely with customers and customer support teams to understand their pain points around supportability and address recurring issues in a sustainable way.
- Enhance the reliability of the service by proactive alerting based on utilization, trends, resource health, etc.
- Experience on PowerShell/Scripting, Windows Services infrastructure to automate day to day activities.
- Implement alerts and Geneva automation.
Basic Qualifications: - Bachelors degree in computer science, engineering, or related technical field
- 5+ years of experience as a Service Reliability Engineer or Software Engineer, running large-scale cloud services.
- 3+ years of operational experience in improving Service Reliability, Availability and Performance
- Experience on ARM Templates, Azure PowerShell
- Strong programming skills in Python or C#
- Experience with cloud platforms such as Azure
- Ability to work independently and collaborate effectively with cross-functional teams.
- Strong problem-solving skills and ability to deal with ambiguity in a fast-paced environment.
- Excellent communication skills and ability to communicate on a deep technical level.
- Experience with monitoring and alerting tools such as Jarvis, Grafana or Prometheus preferred.
- Experience with Logic Apps, Azure Data Explorer and authoring Jupyter Notebooks preferred.
- Microsoft Internals: ICM, Geneva, SAW
T-Stone Technologies Inc
|