Infrastructure Engineer III

Date: Apr 15, 2024

Location: LAKE FOREST, IL, US, 60045-5202

Company: Grainger Businesses

About Grainger:

Grainger is a broad line distributor with operations in North America, Japan and the United Kingdom. We achieve our purpose, We Keep the World Working®, by serving more than 4.5 million customers with multiple products that keep their operations running and their people safe. Grainger also delivers services and solutions, such as technical support and inventory management, to save customers time and money.

We're looking for passionate people who can move our company forward. As one of the 100 Best Companies to Work For, we have a welcoming workplace where you can build a career for yourself while fulfilling our purpose to keep the world working. We embrace new ways of thinking and recognize everyone is an individual. Find your way with Grainger today.

Position Details:

The Infrastructure Engineer specializes in managing AWS-hosted Kubernetes (K8s) platforms engineered primarily for machine learning (ML) training, experimentation, and serving. You are tasked with ensuring a robust and scalable infrastructure that supports advanced ML workloads. Additionally, you will be responsible for the implementation and management of our monitoring ecosystem using Grafana, Loki, Prometheus, and Thanos, as well as maintaining continuous deployment via GitOps best practices with ArgoCD and Flux.

They build, test, implement, configure, tune and support the Kubernetes  infrastructure in the Cloud, including server platforms, storage systems, middleware infrastructure, network, and client technologies.

They pursue the physical design, implementation, and support of major automation solutions in a multiplatform environment, make recommendations for improved usability of automated tools, and identify opportunities for increased adoption of orchestration technologies.

At Grainger, our team members have an opportunity to work in one of the largest SAP-centric and complex, 24x7, E-commerce environments and gain knowledge and experience with many SAP and other application modules running on-prem and in the Cloud. Our Machine Learning Operations team is seeking an experienced Platform or Site Reliability Engineer to support the ML Platform.

 

You Will:

  • Maintain and optimize Kubernetes clusters on AWS tailored for ML workloads, ensuring they are optimized for high-performance computing tasks such as training and serving ML models.
  • Deploy and manage monitoring tools including Grafana, Loki, Prometheus, and Thanos to ensure system health and performance visibility.
  • Maintain continuous deployment systems using ArgoCD and Flux, implementing GitOps workflows for operational efficiency.
  • Develop comprehensive documentation, user guides, and tutorials specifically tailored for ML practitioners utilizing the K8s platform.
  • Deliver training sessions and workshops for a diverse userbase, enhancing their ability to use the Kubernetes platform and associated tooling effectively.
  • Collaborate closely with development teams to enhance CI/CD pipelines and implement cloud-native solutions.
  • Actively manage Kubernetes resources including networking, storage, and security configurations.
  • Evaluate and implement updates within the Kubernetes ecosystem, applying best practices to maintain and enhance platform robustness.

 

You Have:

 

  • Bachelor’s degree in information technology, computer science, or related field.
  • 4+ years’ experience with Kubernetes.
    • 4+ years’ experience with Prometheus is a plus.
    • 2+ years’ experience with GitOps practices is a plus.
  • Must be self-motivated with strong team, interpersonal and communication skills (both verbal and written), and maintain a positive attitude. Moderate autonomy, usually supervised depending on the project being worked on.
  • Strong problem-solving, analytical skills, debugging/resolution skills, and the ability to think algorithmically.
  • Solid understanding of mission-critical production environments, including the requirement for high availability and team-oriented 24x7 support capabilities.
  • Strong time management skills with the ability to prioritize multiple projects simultaneously in an Agile environment. 

Rewards and Benefits:

With benefits starting day one, Grainger is committed to your safety, health and wellbeing. Our programs provide choice and flexibility to meet our team members' individual needs. Check out some of the rewards available to you at Grainger.

  • Medical, dental, vision, and life insurance plans
  • Paid time off (PTO) and 6 company holidays per year
  • Automatic 6% 401(k) company contribution each pay period
  • Employee discounts, parental leave, 3:1 match on donations and tuition reimbursement
  • A comprehensive set of emotional, financial, physical and social wellbeing programs

DEI Statement:

We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender, gender identity or expression, or veteran status. We are proud to be an equal opportunity workplace.