SDSC Voyager Habana Training and Inference Processor based AI System

2FA/MFA RP account needed

Voyager is an innovative AI system designed specifically for science and engineering research at scale. Voyager is focused on supporting research in science and engineering that is increasingly dependent upon artificial intelligence and deep learning as a critical element in the experimental and/or computational work.

Login to SDSC Voyager AI System

Voyager uses ssh key pairs for access.

Approved users will need to send their ssh public key to consult [at] sdsc.edu (consult[at]sdsc[dot]edu) to gain access to the system.

To log in to Voyager from the command line, use the hostname:

login.voyager.sdsc.edu

The following are examples of Secure Shell (ssh) commands that may be used to log in:

ssh <your_username>@login.voyager.sdsc.edu ssh -l <your_username> login.voyager.sdsc.edu

Notes and hints

Voyager will not maintain local passwords, your public key will need to be appended to your ~/.ssh/authorized_keys file to enable access from authorized hosts. We accept RSA, ECDSA and ed25519 keys. Make sure you have a strong passphrase on the private key on your local machine.
- You can use ssh-agent or keychain to avoid repeatedly typing the private key password.
- Hosts which connect to SSH more frequently than ten times per minute may get blocked for a short period of time
Do not use the login node for computationally intensive processes, as hosts for running workflow management tools, as primary data transfer nodes for large or numerous data transfers or as servers providing other services accessible to the Internet. The login nodes are meant for file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be run using kubernetes.

MFA

Voyager does not maintain local passwords and relies entirely on SSH key pairs. Keys must be appended to ~/.ssh/authorized_keys. RSA, ECDSA, and ed25519 keys are accepted, and a strong passphrase on the private key is required.

SSH Login

$ ssh <your_username>@ssh <your_username>@login.voyager.sdsc.edu

User Guide

File Transfer

Supported Methods	Data Transfer Node	URL
GLOBUS (COMING SOON)		https://www.sdsc.edu/systems/voyager/user_guide.html#narrow-wysiwyg-5

Storage

File System

Directory	Path	Quota	Purge	Backup	Notes
home	/home/username	200GB	-	/home and /voyager/projects file system ARE NOT backed up	The home directory is limited in space and should be used only for source code storage. User will have access to 200GB in /home. Users should keep usage on $HOME under 200GB.
scratch	-	-	-	Users are responsible for backing up all important data to protect against data loss at SDSC.	The compute nodes on Voyager have access to fast flash storage. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (<100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications
ceph	/voyager/ceph/users/username	3 PB	-	System is NOT backed up	Every Voyager node has access to a 3 PB Ceph parallel file system, 140 GB/second performance storage. ( /voyager/ceph/user/$USER) IS NOT an archival file system
projects	/voyager/projects/project/username	153TB	-	Users are responsible for backing up all important data to protect against data loss at SDSC.	NSF mounted project space

External Storage

Ceph Parallel File System (/voyager/ceph): This is the primary high-performance storage for large datasets. It is "external" in that it resides on a dedicated storage cluster accessible by all nodes.
Project Storage (NFS): Shared storage used for collaborative projects, providing a single scalable namespace accessible across multiple SDSC systems.
Home Directory (/home): Persistent network storage for source code and small files, limited to 200 GB.

Jobs

Voyager runs Kubernetes. Kubernetes is an open-source platform for managing containerized workloads and services. A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files.

For computer, inference, or gaudi examples: Basic Jobs

Queue specifications

Name	Purpose	CPUs	GPUs	RAM	Jobs 30 days	Wait Time 30-day trend	Wall Time 30-day trend
inference	Dedicated for Habana Gaudi model inference, utilizing 2 first-generation nodes.	2	-	3.2TB	—	—	—
gaudi	Designed for high-performance AI training using 42 Intel Habana Gaudi nodes, each with 8 training processors	2	-	6.4TB	—	—	—
compute	Includes 36 Intel x86 nodes for general-purpose pre/post-data processing.	2	-	3.2TB	—	—	—