Report this Job

or email this job to apply later

Responsibilities include but are not limited to:

Monitor and maintain the Reliability, Availability, and Performance of the Cosmos DB service.
Design and implement Disaster Recovery and Business Continuity plans.
Collaborate with engineering teams to build and enhance tooling and automation solutions that help achieve SLOs and improve customer supportability.
Work closely with customers and customer support teams to understand their pain points around supportability and address recurring issues in a sustainable way.
Enhance the reliability of the service by proactive alerting based on utilization, trends, resource health, etc.
Experience on PowerShell/Scripting, Windows Services infrastructure to automate day to day activities.
Implement alerts and Geneva automation.

Basic Qualifications:

Bachelors degree in computer science, engineering, or related technical field
5+ years of experience as a Service Reliability Engineer or Software Engineer, running large-scale cloud services.
3+ years of operational experience in improving Service Reliability, Availability and Performance
Experience on ARM Templates, Azure PowerShell
Strong programming skills in Python or C#
Experience with cloud platforms such as Azure
Ability to work independently and collaborate effectively with cross-functional teams.
Strong problem-solving skills and ability to deal with ambiguity in a fast-paced environment.
Excellent communication skills and ability to communicate on a deep technical level.
Experience with monitoring and alerting tools such as Jarvis, Grafana or Prometheus preferred.
Experience with Logic Apps, Azure Data Explorer and authoring Jupyter Notebooks preferred.
Microsoft Internals: ICM, Geneva, SAW

T-Stone Technologies Inc

Apply Online

or email this job to apply later

	Search millions of jobs
Jobvertise