Neocortex SDFlex is the HPE Superdome Flex shared-memory host system associated with PSC Neocortex. It is suitable for memory-heavy CPU workloads, preprocessing, data staging, validation, compilation, and workflows that benefit from very large shared RAM.
The SDFlex system features 32 Intel Xeon Platinum 8280L processors, with 28 cores and 56 threads each, 24 TiB of RAM, 4.5 TB/s aggregate memory bandwidth, and 204.6 TB of aggregate local NVMe storage.
SDFlex also supports Neocortex CS-2 workflows by serving as the host environment for tasks such as preprocessing, validation, compilation, and launching Cerebras jobs.
Jobs
More information available at Running Jobs - Neocortex Documentation. Need even more information on standard batch jobs? Refer to Bridges-2 User Guide .
Notes:
- sdf nodes are full compute nodes; use for CPU-heavy workloads or Cerebras training jobs.
- Max running jobs or submission limits are enforced dynamically; Slurm will queue jobs if limits are reached.
Pre-compile your model
- Reserve CPI node and run Cerebras singularity container
srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /ocean/neocortex/cerebras/cbcore_latest.sif
- Neocortex recommends saving this in a file, such as
salloc_node. These are the settings- 1 sdf node: 28 cores, 2 threads per core
--bindis for binding the folders so that they are accessible from inside the singularity shell.- The .sif container here is the symlink to the latest version of the container provided by the Cerebras team. Please use
llfor more details about this container
From inside the singularity shell, for validation only mode:
python run.py --mode train --validate_only --model_dir validatewhere
-ois the output directory,--modeis the mode i.e. compile_only, validate_only, train, eval.From inside the singularity shell, for compile only mode:
python run.py --mode train --compile_only --model_dir compile- You can also start an interactive session on SDF nodes with
interact
Training your Model
Neocortex recommends using the following wrapper scripts for training your model. The first script sets up the parameters for the container, while the second has the static python to start the training.
srun_train
#!/usr/bin/bash srun --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14 --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"
run_train
#!/usr/bin/bash python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"
Make sure both of them have file executable permissions.
chmod +x srun_train run_train
Now run the following command from your project directory to launch training from scratch:
./srun_train --model_dir OUTPUT_DIR
where --model_dir OUTPUT_DIR will be the location where the output from the training will be located. Also, if you want to restart from a checkpoint, just use the same output directory with this parameter.
--model_dir is the same as using -o
In order to launch training from pre-compiled artifacts, specify --model_dir with the output directory you used while compiling, compile in this case (refer to the subsection above).
./srun_train --model_dir compile
For evaluation purposes: python run.py --mode eval --model_dir train
Batch Jobs
- Neocortex asks that their users use a modifiable model script for their batching.
neocortex_model.sbatch
#!/usr/bin/bash #SBATCH --gres=cs:cerebras:1 #SBATCH --ntasks=7 #SBATCH --cpus-per-task=14 newgrp GRANT_ID cp ${0} slurm-${SLURM_JOB_ID}.sbatch # This should be the path in which you are storing your own dataset, if applicable. For example, ${PROJECT}/shared/dataset (that would point to your shared folder under /ocean/project/GRANT_ID/) YOUR_DATA_DIR=${LOCAL}/cerebras/data # This should be the path in which you are storing your own model. YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/ # This should be the place in which the run.py file is located. YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf # These paths are the ones that contain the input dataset and the code files required for your model to run. BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR} CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif cd ${YOUR_ENTRY_SCRIPT_LOCATION} # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile # This command will use the default guidance used at the top of this file. In this case, 7 tasks. srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir train --cs_ip ${CS_IP_ADDR}
You can check the status of your submitted job via the squeue command:
squeue -u PSC_USERNAME
Queue specifications
Metrics updated 2026-06-16
| Name | Purpose | CPUs | GPUs | RAM | Jobs
30 days
|
Wait Time
30-day trend
|
Wall Time
30-day trend
|
|---|---|---|---|---|---|---|---|
| sdf | large-memory compute | 28 Intel Xeon Platinum 8280L | 204.6 TB aggregate local storage | 7 | 0 sdf wait time: 0 hours |
|
Storage
File System
| Directory | Path | Quota | Purge | Backup | Notes |
|---|---|---|---|---|---|
| Ocean | /ocean | Allocation Based | Primary large-scale shared storage | ||
| Project | /ocean/projects/<allocation>/<user> | Allocation Based | User project storage | ||
| Jet | /jet | Allocation Based | Secondary shared filesystem | ||
| Local | Job Only | High-speed temporary storage | |||
| Root | / | N/A | System filesystem (not for jobs) |
External Storage
/local{1,2,3,4} Storage
- Neocortex provides node-local high-speed storage on SDFlex compute nodes.
- Each node contains four local disks: /local1, /local2, /local3, /local4.
- Users should always use the $LOCAL environment variable, which automatically points to the optimal disk for the job.
Accessing local storage:
echo $LOCALcd $LOCAL
- $LOCAL is dynamically assigned based on CPU and NUMA affinity for best performance.
- Local storage is not persistent and is cleared after jobs complete.
- Best used for:
- Temporary data
- High-performance I/O
- Caching datasets from shared storage (e.g., Ocean)
File Transfer
Use the Neocortex data transfer node for moving files to and from Neocortex. PSC recommends rsync as the preferred method. SFTP is also supported. SCP is available but not recommended for large transfers.
rsync -PaL --chmod u+w /local-path/to/dataset PSC_USERNAME [at] data.neocortex.psc.edu:/ocean/projects/GRANT_ID/shared/
| Supported Methods | Data Transfer Node | URL |
|---|---|---|
| RSYNC | RECOMMENDED | data.neocortex.psc.edu | |
| SFTP | data.neocortex.psc.edu | |
| SCP | data.neocortex.psc.edu |
Login to Neocortex SDFlex
Use your PSC credentials to access Neocortex SDFlex. After logging in, check your available allocation groups with:
projects
groupsIf you have access to multiple allocations, switch to the correct group before using project storage or submitting jobs:
newgrp GROUPID