Neocortex SDFlex

RP account needed

Neocortex SDFlex is the HPE Superdome Flex shared-memory host system associated with PSC Neocortex. It is suitable for memory-heavy CPU workloads, preprocessing, data staging, validation, compilation, and workflows that benefit from very large shared RAM.

The SDFlex system features 32 Intel Xeon Platinum 8280L processors, with 28 cores and 56 threads each, 24 TiB of RAM, 4.5 TB/s aggregate memory bandwidth, and 204.6 TB of aggregate local NVMe storage.

SDFlex also supports Neocortex CS-2 workflows by serving as the host environment for tasks such as preprocessing, validation, compilation, and launching Cerebras jobs.

Jobs

More information available at Running Jobs - Neocortex Documentation. Need even more information on standard batch jobs? Refer to Bridges-2 User Guide .

Notes:

sdf nodes are full compute nodes; use for CPU-heavy workloads or Cerebras training jobs.
Max running jobs or submission limits are enforced dynamically; Slurm will queue jobs if limits are reached.

Pre-compile your model

Reserve CPI node and run Cerebras singularity container

srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /ocean/neocortex/cerebras/cbcore_latest.sif

Neocortex recommends saving this in a file, such as salloc_node. These are the settings
- 1 sdf node: 28 cores, 2 threads per core
- --bind is for binding the folders so that they are accessible from inside the singularity shell.
- The .sif container here is the symlink to the latest version of the container provided by the Cerebras team. Please use ll for more details about this container
From inside the singularity shell, for validation only mode:
python run.py --mode train --validate_only --model_dir validate
where -o is the output directory, --mode is the mode i.e. compile_only, validate_only, train, eval.
From inside the singularity shell, for compile only mode:
python run.py --mode train --compile_only --model_dir compile
You can also start an interactive session on SDF nodes with interact

Training your Model

Neocortex recommends using the following wrapper scripts for training your model. The first script sets up the parameters for the container, while the second has the static python to start the training.

srun_train

#!/usr/bin/bash srun --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14 --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"

run_train

#!/usr/bin/bash python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"

Make sure both of them have file executable permissions.
chmod +x srun_train run_train

Now run the following command from your project directory to launch training from scratch:

./srun_train --model_dir OUTPUT_DIR

where --model_dir OUTPUT_DIR will be the location where the output from the training will be located. Also, if you want to restart from a checkpoint, just use the same output directory with this parameter.

--model_dir is the same as using -o

In order to launch training from pre-compiled artifacts, specify --model_dir with the output directory you used while compiling, compile in this case (refer to the subsection above).

./srun_train --model_dir compile

For evaluation purposes: python run.py --mode eval --model_dir train

Batch Jobs

Neocortex asks that their users use a modifiable model script for their batching.

neocortex_model.sbatch

#!/usr/bin/bash #SBATCH --gres=cs:cerebras:1 #SBATCH --ntasks=7 #SBATCH --cpus-per-task=14 newgrp GRANT_ID cp ${0} slurm-${SLURM_JOB_ID}.sbatch # This should be the path in which you are storing your own dataset, if applicable. For example, ${PROJECT}/shared/dataset (that would point to your shared folder under /ocean/project/GRANT_ID/) YOUR_DATA_DIR=${LOCAL}/cerebras/data # This should be the path in which you are storing your own model. YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/ # This should be the place in which the run.py file is located. YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf # These paths are the ones that contain the input dataset and the code files required for your model to run. BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR} CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif cd ${YOUR_ENTRY_SCRIPT_LOCATION} # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile # This command will use the default guidance used at the top of this file. In this case, 7 tasks. srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir train --cs_ip ${CS_IP_ADDR}

You can check the status of your submitted job via the squeue command:

squeue -u PSC_USERNAME

Queue specifications

Metrics updated 2026-06-16

Name	Purpose	CPUs	GPUs	RAM	Jobs 30 days	Wait Time 30-day trend	Wall Time 30-day trend
sdf	large-memory compute	28 Intel Xeon Platinum 8280L		204.6 TB aggregate local storage	7	0 sdf wait time: 0 hours

Storage

File System

Directory	Path	Quota	Notes
Ocean	/ocean	Allocation Based	Primary large-scale shared storage
Project	/ocean/projects/<allocation>/<user>	Allocation Based	User project storage
Jet	/jet	Allocation Based	Secondary shared filesystem
Local	Job Only		High-speed temporary storage
Root	/	N/A	System filesystem (not for jobs)

External Storage

/local{1,2,3,4} Storage

Neocortex provides node-local high-speed storage on SDFlex compute nodes.
Each node contains four local disks: /local1, /local2, /local3, /local4.
Users should always use the $LOCAL environment variable, which automatically points to the optimal disk for the job.

Accessing local storage:

echo $LOCAL
cd $LOCAL

$LOCAL is dynamically assigned based on CPU and NUMA affinity for best performance.
Local storage is not persistent and is cleared after jobs complete.
Best used for:
- Temporary data
- High-performance I/O
- Caching datasets from shared storage (e.g., Ocean)

File Transfer

Use the Neocortex data transfer node for moving files to and from Neocortex. PSC recommends rsync as the preferred method. SFTP is also supported. SCP is available but not recommended for large transfers.

rsync -PaL --chmod u+w /local-path/to/dataset PSC_USERNAME [at] data.neocortex.psc.edu:/ocean/projects/GRANT_ID/shared/

Supported Methods	Data Transfer Node	URL
RSYNC \| RECOMMENDED	data.neocortex.psc.edu
SFTP	data.neocortex.psc.edu
SCP	data.neocortex.psc.edu

Login to Neocortex SDFlex

Use your PSC credentials to access Neocortex SDFlex. After logging in, check your available allocation groups with:

projects
groups

If you have access to multiple allocations, switch to the correct group before using project storage or submitting jobs:

newgrp GROUPID

SSH Login

$ ssh PSC_USERNAME@neocortex.psc.edu

Neocortex User Guide - Connecting to the System

ACCESS OnDemand Login

NEOCORTEX OPEN ONDEMAND

How do I use ACCESS OnDemand?