PSC Neocortex CS-2

Neocortex CS-2 provides newer and more powerful GPU nodes compared to CS, designed for larger models and more intensive AI training tasks. It supports faster experimentation with higher-performance hardware.

Ask about Neocortex CS-2

File Transfer

Supported Methods Data Transfer Node URL
RSYNC | RECOMMENDED data.neocortex.psc.edu
SCP data.neocortex.psc.edu
SFTP data.neocortex.psc.edu

Storage

File System

Directory Path Quota Purge Backup Notes
Ocean /ocean Allocation Based Primary large-scale shared storage
Project /ocean/projects/<allocation>/<user> Allocation Based User project storage
Jet /jet Allocation Based Secondary shared filesystem
Local Job Only High-speed temporary storage
Root / N/A System filesystem (not for jobs)

External Storage

/local{1,2,3,4} Storage

  • Neocortex provides node-local high-speed storage on SDFlex compute nodes.
  • Each node contains four local disks: /local1, /local2, /local3, /local4.
  • Users should always use the $LOCAL environment variable, which automatically points to the optimal disk for the job.

Accessing local storage:

echo $LOCAL
cd $LOCAL

  • $LOCAL is dynamically assigned based on CPU and NUMA affinity for best performance.
  • Local storage is not persistent and is cleared after jobs complete.
  • Best used for:
    • Temporary data
    • High-performance I/O
    • Caching datasets from shared storage (e.g., Ocean)

Jobs

More information available at Running Jobs - Neocortex Documentation. Need even more information on standard batch jobs? Refer to Bridges-2 User Guide .

Notes:

  • sdf nodes are full compute nodes; use for CPU-heavy workloads or Cerebras training jobs.
  • Max running jobs or submission limits are enforced dynamically; Slurm will queue jobs if limits are reached.

Pre-compile your model

  • Reserve CPI node and run Cerebras singularity container 

srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /ocean/neocortex/cerebras/cbcore_latest.sif

  • Neocortex recommends saving this in a file, such as salloc_node. These are the settings
    • 1 sdf node: 28 cores, 2 threads per core
    • --bind is for binding the folders so that they are accessible from inside the singularity shell.
    • The .sif container here is the symlink to the latest version of the container provided by the Cerebras team. Please use ll for more details about this container
  • From inside the singularity shell, for validation only mode:

    python run.py --mode train --validate_only --model_dir validate

    where -o is the output directory, --mode is the mode i.e. compile_only, validate_only, train, eval.

  • From inside the singularity shell, for compile only mode:

    python run.py --mode train --compile_only --model_dir compile

  • You can also start an interactive session on SDF nodes with interact

Training your Model

Neocortex recommends using the following wrapper scripts for training your model. The first script sets up the parameters for the container, while the second has the static python to start the training.

srun_train

#!/usr/bin/bash srun --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14  --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"

run_train

#!/usr/bin/bash python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"

  • Make sure both of them have file executable permissions.

    chmod +x srun_train run_train

Now run the following command from your project directory to launch training from scratch:

./srun_train --model_dir OUTPUT_DIR

where --model_dir OUTPUT_DIR will be the location where the output from the training will be located. Also, if you want to restart from a checkpoint, just use the same output directory with this parameter.

--model_dir is the same as using -o

In order to launch training from pre-compiled artifacts, specify --model_dir with the output directory you used while compiling, compile in this case (refer to the subsection above).

./srun_train --model_dir compile

Batch Jobs

  • Neocortex asks that their users use a modifiable model script for their batching.

neocortex_model.sbatch

#!/usr/bin/bash #SBATCH --gres=cs:cerebras:1 #SBATCH --ntasks=7 #SBATCH --cpus-per-task=14 newgrp GRANT_ID cp ${0} slurm-${SLURM_JOB_ID}.sbatch # This should be the path in which you are storing your own dataset, if applicable. For example, ${PROJECT}/shared/dataset (that would point to your shared folder under /ocean/project/GRANT_ID/) YOUR_DATA_DIR=${LOCAL}/cerebras/data # This should be the path in which you are storing your own model. YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/ # This should be the place in which the run.py file is located. YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf # These paths are the ones that contain the input dataset and the code files required for your model to run. BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR} CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif cd ${YOUR_ENTRY_SCRIPT_LOCATION} # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile # This command will use the default guidance used at the top of this file. In this case, 7 tasks. srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir train --cs_ip ${CS_IP_ADDR}

You can check the status of your submitted job via the squeue command:

squeue -u PSC_USERNAME

Queue specifications

Name Purpose CPUs GPUs RAM Jobs
30 days
Wait Time
30-day trend
Wall Time
30-day trend
SDF regular compute 850,000 Linear Algebra Compute Cores 40 GB SRAM on-chip memory