PSC Neocortex CS-2

Neocortex CS-2 provides newer and more powerful GPU nodes compared to CS, designed for larger models and more intensive AI training tasks. It supports faster experimentation with higher-performance hardware.

Login to Neocortex CS-2

Setting Up Your PSC Account

When you are granted a PSC account, you will receive an email with your PSC username and a link to set your initial password.
You can now use SSH with your PSC username and password.
If you forget your password, then you can use the following link to recover it: Password Reset

Additional Account Information

Adding new users to your projects: Accounts - Neocortex Documentation

SSH Login

$ ssh PSC_USERNAME@neocortex.psc.edu

Neocortex User Guide - Connecting to the System

ACCESS OnDemand Login

How do I use ACCESS OnDemand?

File Transfer

Supported Methods	Data Transfer Node	URL
RSYNC \| RECOMMENDED	data.neocortex.psc.edu
SCP	data.neocortex.psc.edu
SFTP	data.neocortex.psc.edu

Storage

File System

Directory	Path	Quota	Notes
Ocean	/ocean	Allocation Based	Primary large-scale shared storage
Project	/ocean/projects/<allocation>/<user>	Allocation Based	User project storage
Jet	/jet	Allocation Based	Secondary shared filesystem
Local	Job Only		High-speed temporary storage
Root	/	N/A	System filesystem (not for jobs)

External Storage

/local{1,2,3,4} Storage

Neocortex provides node-local high-speed storage on SDFlex compute nodes.
Each node contains four local disks: /local1, /local2, /local3, /local4.
Users should always use the $LOCAL environment variable, which automatically points to the optimal disk for the job.

Accessing local storage:

echo $LOCAL
cd $LOCAL

$LOCAL is dynamically assigned based on CPU and NUMA affinity for best performance.
Local storage is not persistent and is cleared after jobs complete.
Best used for:
- Temporary data
- High-performance I/O
- Caching datasets from shared storage (e.g., Ocean)

Jobs

More information available at Running Jobs - Neocortex Documentation. Need even more information on standard batch jobs? Refer to Bridges-2 User Guide .

Notes:

sdf nodes are full compute nodes; use for CPU-heavy workloads or Cerebras training jobs.
Max running jobs or submission limits are enforced dynamically; Slurm will queue jobs if limits are reached.

Pre-compile your model

Reserve CPI node and run Cerebras singularity container

srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /ocean/neocortex/cerebras/cbcore_latest.sif

Neocortex recommends saving this in a file, such as salloc_node. These are the settings
- 1 sdf node: 28 cores, 2 threads per core
- --bind is for binding the folders so that they are accessible from inside the singularity shell.
- The .sif container here is the symlink to the latest version of the container provided by the Cerebras team. Please use ll for more details about this container
From inside the singularity shell, for validation only mode:
python run.py --mode train --validate_only --model_dir validate
where -o is the output directory, --mode is the mode i.e. compile_only, validate_only, train, eval.
From inside the singularity shell, for compile only mode:
python run.py --mode train --compile_only --model_dir compile
You can also start an interactive session on SDF nodes with interact

Training your Model

Neocortex recommends using the following wrapper scripts for training your model. The first script sets up the parameters for the container, while the second has the static python to start the training.

srun_train

#!/usr/bin/bash srun --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14 --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"

run_train

#!/usr/bin/bash python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"

Make sure both of them have file executable permissions.
chmod +x srun_train run_train

Now run the following command from your project directory to launch training from scratch:

./srun_train --model_dir OUTPUT_DIR

where --model_dir OUTPUT_DIR will be the location where the output from the training will be located. Also, if you want to restart from a checkpoint, just use the same output directory with this parameter.

--model_dir is the same as using -o

In order to launch training from pre-compiled artifacts, specify --model_dir with the output directory you used while compiling, compile in this case (refer to the subsection above).

./srun_train --model_dir compile

Batch Jobs

Neocortex asks that their users use a modifiable model script for their batching.

neocortex_model.sbatch

#!/usr/bin/bash #SBATCH --gres=cs:cerebras:1 #SBATCH --ntasks=7 #SBATCH --cpus-per-task=14 newgrp GRANT_ID cp ${0} slurm-${SLURM_JOB_ID}.sbatch # This should be the path in which you are storing your own dataset, if applicable. For example, ${PROJECT}/shared/dataset (that would point to your shared folder under /ocean/project/GRANT_ID/) YOUR_DATA_DIR=${LOCAL}/cerebras/data # This should be the path in which you are storing your own model. YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/ # This should be the place in which the run.py file is located. YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf # These paths are the ones that contain the input dataset and the code files required for your model to run. BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR} CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif cd ${YOUR_ENTRY_SCRIPT_LOCATION} # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile # This command will use the default guidance used at the top of this file. In this case, 7 tasks. srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir train --cs_ip ${CS_IP_ADDR}

You can check the status of your submitted job via the squeue command:

squeue -u PSC_USERNAME

Queue specifications

Name	Purpose	CPUs	GPUs	RAM	Jobs 30 days	Wait Time 30-day trend	Wall Time 30-day trend
SDF	regular compute	850,000 Linear Algebra Compute Cores		40 GB SRAM on-chip memory	—	—	—