Echo IT SolutionsVerified Employer

Business Services & Consulting • all cities, DC 8

GCP Supercomputer Solutions Support (8)

all cities, DC 8On-sitePosted 1 day ago

Business Services & Consulting

About the Role

Google Cloud Platform Supercomputer Solutions

Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.

Cluster Toolkit

Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.

Ongoing Responsibilities:

Stability Testing: Test the stability of new products, beginning with A3U. This includes:

Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
Setting up and running pairwise tests to identify and report bad nodes.

Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:

Monitoring daily failure chats and flake tools.
Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.

Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:

Gathering existing documents and identifying information gaps.
Creating new documentation and updating existing materials.
Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.

Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.

Key Deliverables:

HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.

HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables:

API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:

HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
Network: NetworkInitialize params.
Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.

Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:

Creating a cluster that consumes a reservation.
Creating a cluster with a new network and new storage.
Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
Destroying all components of an HCS-created cluster.
Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

Google Cloud Platform Supercomputer Solutions

Scope of Work & Deliverables

The supplier will be responsible for the services and deliverables detailed below.

Ongoing Maintenance

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.

Cluster Toolkit

Ongoing Responsibilities:

Stability Testing: Test the stability of new products, beginning with A3U. This includes:

Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
Setting up and running pairwise tests to identify and report bad nodes.

Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:

Monitoring daily failure chats and flake tools.
Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.

Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:

Gathering existing documents and identifying information gaps.
Creating new documentation and updating existing materials.
Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.

Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.

Key Deliverables:

HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.

HyperCompute Cluster Service (HCS)

HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.

Key Deliverables:

API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:

HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
Network: NetworkInitialize params.
Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.

Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:

Creating a cluster that consumes a reservation.
Creating a cluster with a new network and new storage.
Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
Destroying all components of an HCS-created cluster.
Destroying a cluster while leaving the network and storage intact.

Updating a Slurm cluster to add a new reservation to both new and existing partitions

What You'll Do

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.

Stability Testing: Test the stability of new products, beginning with A3U. This includes:

Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.

Setting up and running pairwise tests to identify and report bad nodes.

Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:

Monitoring daily failure chats and flake tools.

Skills & Technologies

Business Services & Consulting

Overview

GCP Supercomputer Solutions Support (8)

About the Role

Google Cloud Platform Supercomputer Solutions

Scope of Work & Deliverables

Ongoing Maintenance

Cluster Toolkit

Ongoing Responsibilities:

Key Deliverables:

HyperCompute Cluster Service (HCS)

Key Deliverables:

What You'll Do

Skills & Technologies