joblet.ai
Find JobsNearby JobsJobs for you
Sign inEmployers / Post a Job
joblet.ai

AI-powered job search connecting talent with opportunity.

ELEVEN AI, Inc.
200 Continental Drive, Suite 401
Newark, DE 19713

Product

  • Browse Jobs
  • Job Locations
  • Browse by Companies
  • Post a Job
  • Blog
  • FAQ
  • Jobs Near Me

Company

  • About Us
  • Contact
  • Refer & Earn
  • Explore all pages

Legal

  • Privacy Policy
  • Cookie Policy
  • Terms of Service

Browse jobs by industry

  • AI
  • IT Services
  • Healthcare
  • Manufacturing & Production
  • Supply Chain
  • Infrastructure
  • Transport & Logistics
  • Real Estate
  • Finance & Accounting
  • Consulting
  • Sales & Marketing
  • Hospitality
  • Media & Entertainment
  • Education

© 2026 ELEVEN AI, Inc. joblet.ai is a product of ELEVEN AI, Inc. All rights reserved.

Overview

Company
Echo IT Solutions
Location
all cities, DC 8
Employment type
On-site
  • Local Study Associate Director - Spain - FSP (8)
  • Automation Engineer (8)
  • Vice President, Science and Technology (8)
  • Tax Manager - Construction (8)
  • Customer Success Specialist (8)
  • Supply Chain Manager (8)
Back to Jobs
Echo IT SolutionsVerified Employer

Business Services & Consulting • all cities, DC 8

GCP Supercomputer Solutions Support (8)

all cities, DC 8On-sitePosted 1 day ago
Business Services & Consulting

About the Role

Google Cloud Platform Supercomputer Solutions

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance
  • The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Cluster Toolkit

Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities:
  • Stability Testing: Test the stability of new products, beginning with A3U. This includes:
    • Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
    • Setting up and running pairwise tests to identify and report bad nodes.
  • Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
    • Monitoring daily failure chats and flake tools.
    • Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
  • Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
    • Gathering existing documents and identifying information gaps.
    • Creating new documentation and updating existing materials.
    • Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
  • Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
  • Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables:
  • HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
  • Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables:
  • API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
    • HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
    • Network: NetworkInitialize params.
    • Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
    • Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
  • Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
    • Creating a cluster that consumes a reservation.
    • Creating a cluster with a new network and new storage.
    • Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
    • Destroying all components of an HCS-created cluster.
    • Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

Google Cloud Platform Supercomputer Solutions

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance
  • The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Cluster Toolkit

Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities:
  • Stability Testing: Test the stability of new products, beginning with A3U. This includes:
    • Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
    • Setting up and running pairwise tests to identify and report bad nodes.
  • Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
    • Monitoring daily failure chats and flake tools.
    • Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
  • Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
    • Gathering existing documents and identifying information gaps.
    • Creating new documentation and updating existing materials.
    • Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
  • Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
  • Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables:
  • HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
  • Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables:
  • API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
    • HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
    • Network: NetworkInitialize params.
    • Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
    • Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
  • Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
    • Creating a cluster that consumes a reservation.
    • Creating a cluster with a new network and new storage.
    • Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
    • Destroying all components of an HCS-created cluster.
    • Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

What You'll Do

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Stability Testing: Test the stability of new products, beginning with A3U. This includes:
Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
Setting up and running pairwise tests to identify and report bad nodes.
Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
Monitoring daily failure chats and flake tools.

Skills & Technologies

Business Services & Consulting

Similar jobs

Local Study Associate Director - Spain - FSP (8)
Parexel
all cities, DC 8Posted 4 days ago
Automation Engineer (8)
Deutsche Telekom IT Solutions
all cities, DC 8Posted 6 days ago
Vice President, Science and Technology (8)
LGC Biosearch Technologies
all cities, DC 8Posted 6 days ago
Tax Manager - Construction (8)
Grassi & Co., CPA's P.C
all cities, DC 8Posted 6 days ago
Customer Success Specialist (8)
Allegion
all cities, DC 8Posted 1 day ago
Supply Chain Manager (8)
MRINetwork
all cities, DC 8Posted 12 days ago
Echo IT Solutions
Business Services & Consulting
View all jobs at Echo IT Solutions