or email this job to apply later

The Data Infrastructure group within the AI/ML organization powers the analytics, experimentation and ML feature engineering that powers the Machine Learning technologies we all love in our devices. Our mission is to provide cutting edge, reliable and easy to use infrastructure for ingesting, storing, processing and interacting with data while keeping users' data private and secure. The Core Infra team sits within the AI/ML Data Infrastructure group and is looking for an engineer to bring their passion for scalability and efficiency to help build world class data infrastructure enabling data engineers and scientists to produce world-class ML data products.

Key Qualifications

3+ years of experience scaling and operating distributed systems like
3+ big data processing engines (e.g., Apache Hadoop, Apache Spark)
3+ distributed file systems (e.g. HDFS, CEPH, S3, etc.)
3+ systems (e.g., Apache Flink, Apache Kafka)
3+ systems (e.g., Apache Mesos, Kubernetes)
3+ Management (e.g. Apache Ranger, Sentry, OPA)
3+ infrastructure as code and systems automation
Fluency in Java or a similar language
Ability to debug complex issues in large scale distributed systems Passion for building infrastructure that is reliable, easy to use and easy to maintain Excellent communication and collaboration skills
Experience with Spark and ETL processing pipelines is helpful, but not required
Experience with systems security, identity protocols and encryption is helpful, but not required

Description The ideal candidate will have outstanding communication skills, proven data infrastructure design and implementation capabilities, strong business acumen, and an innate drive to deliver results. They will be a self-starter, comfortable with ambiguity and will enjoy working in a fast-paced dynamic environment.

Responsibilities include:

Improving the efficiency of the large compute fleet in AIML. This compute infrastructure supports various data processing and analytics workloads across AIML on Cloud, including widely used compute engines like Spark, Flink, Kafka, and more.
Collaborate with partner teams to design, develop, and maintain scalable and efficient data infrastructure. This includes deploying Kubernetes clusters, setup network, storage, monitoring and observability tools.
Optimize compute resources with smart autoscaling strategies. Streamline and fine-tune the allocation and utilization of compute resources within the fleet, ensuring maximum efficiency for all workloads.
Identify opportunities to improve the overall performance of the compute fleet, minimizing latency, enhancing throughput, and reducing bottlenecks.
Implement robust monitoring and reporting mechanisms to track the performance, resource utilization, and cost continuously. Utilize this data to identify areas for further optimization and proactively address any issues or inefficiencies.

VSG Business Solutions LLC

Apply Online

or email this job to apply later

	Search millions of jobs
Jobvertise

Report this job