joblet.ai
Find JobsNearby JobsJobs for you
Sign inEmployers / Post a Job
joblet.ai

AI-powered job search connecting talent with opportunity.

ELEVEN AI, Inc.
200 Continental Drive, Suite 401
Newark, DE 19713

Product

  • Browse Jobs
  • Job Locations
  • Browse by Companies
  • Post a Job
  • Blog
  • FAQ
  • Jobs Near Me

Company

  • About Us
  • Contact
  • Refer & Earn
  • Explore all pages

Legal

  • Privacy Policy
  • Cookie Policy
  • Terms of Service

Browse jobs by industry

  • AI
  • IT Services
  • Healthcare
  • Manufacturing & Production
  • Supply Chain
  • Infrastructure
  • Transport & Logistics
  • Real Estate
  • Finance & Accounting
  • Consulting
  • Sales & Marketing
  • Hospitality
  • Media & Entertainment
  • Education

© 2026 ELEVEN AI, Inc. joblet.ai is a product of ELEVEN AI, Inc. All rights reserved.

Overview

Company
Echo IT Solutions
Location
all cities, WA 48
Employment type
On-site
  • Product Director, AI and Data Products - East Coast (48)
  • Principal Client Executive - Financial Services (48)
  • MDIG- Test Automation Engineer (48)
  • Customer Engagement Director (Remote) (44)
  • Founding Regional Sales Director, Cloud Security, Remote, MA (33)
  • Managing Consultant - Property Engineer (Remote) (4)
Back to Jobs
Echo IT SolutionsVerified Employer

Business Services & Consulting • all cities, WA 48

GCP Supercomputer Solutions Support (48)

all cities, WA 48On-sitePosted 1 day ago
Business Services & Consulting

About the Role

Google Cloud Platform Supercomputer Solutions

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance
  • The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Cluster Toolkit

Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities
  • Stability Testing: Test the stability of new products, beginning with A3U. This includes:
    • Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
    • Setting up and running pairwise tests to identify and report bad nodes.
  • Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
    • Monitoring daily failure chats and flake tools.
    • Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
  • Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
    • Gathering existing documents and identifying information gaps.
    • Creating new documentation and updating existing materials.
    • Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
  • Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
  • Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables
  • HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
  • Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables
  • API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
    • HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
    • Network: NetworkInitialize params.
    • Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
    • Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
  • Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
    • Creating a cluster that consumes a reservation.
    • Creating a cluster with a new network and new storage.
    • Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
    • Destroying all components of an HCS-created cluster.
    • Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

Google Cloud Platform Supercomputer Solutions

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance
  • The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Cluster Toolkit

Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities
  • Stability Testing: Test the stability of new products, beginning with A3U. This includes:
    • Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
    • Setting up and running pairwise tests to identify and report bad nodes.
  • Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
    • Monitoring daily failure chats and flake tools.
    • Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
  • Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
    • Gathering existing documents and identifying information gaps.
    • Creating new documentation and updating existing materials.
    • Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
  • Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
  • Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
Key Deliverables
  • HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
  • Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables
  • API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
    • HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
    • Network: NetworkInitialize params.
    • Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
    • Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
  • Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
    • Creating a cluster that consumes a reservation.
    • Creating a cluster with a new network and new storage.
    • Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
    • Destroying all components of an HCS-created cluster.
    • Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

What You'll Do

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Stability Testing: Test the stability of new products, beginning with A3U. This includes:
Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
Setting up and running pairwise tests to identify and report bad nodes.
Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
Monitoring daily failure chats and flake tools.

Skills & Technologies

Business Services & Consulting

Similar jobs

Product Director, AI and Data Products - East Coast (48)
StarCompliance
all cities, WA 48Posted 6 days ago
Principal Client Executive - Financial Services (48)
The Select Group
all cities, WA 48Posted 5 days ago
MDIG- Test Automation Engineer (48)
METRO/MAKRO
all cities, WA 48Posted 5 days ago
Customer Engagement Director (Remote) (44)
Cengage Group
all cities, TX 44Posted 5 days ago
Founding Regional Sales Director, Cloud Security, Remote, MA (33)
Planet Green Search
all cities, NM 33Posted 12 hours ago
Managing Consultant - Property Engineer (Remote) (4)
Marsh & McLennan
all cities, AZ 4Posted 12 hours ago
Echo IT Solutions
Business Services & Consulting
View all jobs at Echo IT Solutions