Infrastructure Nerds shell

High Performance Computer Clusters

High Performance Computing, or HPC, is what we call computers running in clusters running some sort of workload that requires a lot of CPU, memory, and/or GPU resources. Clusters are set up in such a way that users get access to as many of those resources as possible without having resource contention, and accomplishes this by use of a job scheduling software. They're often interconnected with a high speed, low latency Infiniband back end to allow for high speed storage access and/or multi-node MPI jobs that require Remote Direct Memory Access (RDMA).

Infrastructure Nerds has experience managing HPC clusters using several job schedulers running on multiple operating systems, at scales spanning from two nodes to the thousands. Our DevOps approach ensures consistency across the computing environment, and our storage experience ensures that your data is available where you need it to be.

Use Cases

Here are a few of the possible use cases for HPC clusters:

  • Scientific Research
  • Video Encoding
  • Artificial Inteligence (AI) / Machine Learning (ML)

Job Schedulers

There are many job schedulers out there. Here are the ones that Infrastructure Nerds has managed:

  • Slurm
  • Sun Grid Engine (SGE)
  • torque (aka openPBS)

Our Approach

What sets us apart from most cluster managers is our commitment to DevOps approaches within the HPC space. Many people running clusters tend to manage them based on system images, which can get stale, bringing security and compliance risks. We prefer treat HPC nodes and their management servers as just another 'role' of server within our configuration management environment, subject to the same regular compliance checks and update rules.

HPC cluster nodes aren't special snowflakes that need their own special methodolgies, at least until you scale above a thousand nodes. They're just servers with a specific use case, and should be treated as such.