Neocortex CS-2 provides newer and more powerful GPU nodes compared to CS, designed for larger models and more intensive AI training tasks. It supports faster experimentation with higher-performance hardware.
Login to Neocortex CS-2
Setting Up Your PSC Account
- When you are granted a PSC account, you will receive an email with your PSC username and a link to set your initial password.
- You can now use SSH with your PSC username and password.
- If you forget your password, then you can use the following link to recover it: Password Reset
Additional Account Information
- Adding new users to your projects: Accounts - Neocortex Documentation
File Transfer
| Supported Methods | Data Transfer Node | URL |
|---|---|---|
| RSYNC | RECOMMENDED | data.neocortex.psc.edu | |
| SCP | data.neocortex.psc.edu | |
| SFTP | data.neocortex.psc.edu |
Storage
File System
| Directory | Path | Quota | Purge | Backup | Notes |
|---|---|---|---|---|---|
| Ocean | /ocean | Allocation Based | Primary large-scale shared storage | ||
| Project | /ocean/projects/<allocation>/<user> | Allocation Based | User project storage | ||
| Jet | /jet | Allocation Based | Secondary shared filesystem | ||
| Local | Job Only | High-speed temporary storage | |||
| Root | / | N/A | System filesystem (not for jobs) |
External Storage
/local{1,2,3,4} Storage
- Neocortex provides node-local high-speed storage on SDFlex compute nodes.
- Each node contains four local disks: /local1, /local2, /local3, /local4.
- Users should always use the $LOCAL environment variable, which automatically points to the optimal disk for the job.
Accessing local storage:
echo $LOCALcd $LOCAL
- $LOCAL is dynamically assigned based on CPU and NUMA affinity for best performance.
- Local storage is not persistent and is cleared after jobs complete.
- Best used for:
- Temporary data
- High-performance I/O
- Caching datasets from shared storage (e.g., Ocean)
Jobs
More information available at Running Jobs - Neocortex Documentation. Need even more information on standard batch jobs? Refer to Bridges-2 User Guide .
Notes:
- sdf nodes are full compute nodes; use for CPU-heavy workloads or Cerebras training jobs.
- Max running jobs or submission limits are enforced dynamically; Slurm will queue jobs if limits are reached.
Pre-compile your model
- Reserve CPI node and run Cerebras singularity container
srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /ocean/neocortex/cerebras/cbcore_latest.sif
- Neocortex recommends saving this in a file, such as
salloc_node. These are the settings- 1 sdf node: 28 cores, 2 threads per core
--bindis for binding the folders so that they are accessible from inside the singularity shell.- The .sif container here is the symlink to the latest version of the container provided by the Cerebras team. Please use
llfor more details about this container
From inside the singularity shell, for validation only mode:
python run.py --mode train --validate_only --model_dir validatewhere
-ois the output directory,--modeis the mode i.e. compile_only, validate_only, train, eval.From inside the singularity shell, for compile only mode:
python run.py --mode train --compile_only --model_dir compile- You can also start an interactive session on SDF nodes with
interact
Training your Model
Neocortex recommends using the following wrapper scripts for training your model. The first script sets up the parameters for the container, while the second has the static python to start the training.
srun_train
#!/usr/bin/bash srun --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14 --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"
run_train
#!/usr/bin/bash python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"
Make sure both of them have file executable permissions.
chmod +x srun_train run_train
Now run the following command from your project directory to launch training from scratch:
./srun_train --model_dir OUTPUT_DIR
where --model_dir OUTPUT_DIR will be the location where the output from the training will be located. Also, if you want to restart from a checkpoint, just use the same output directory with this parameter.
--model_dir is the same as using -o
In order to launch training from pre-compiled artifacts, specify --model_dir with the output directory you used while compiling, compile in this case (refer to the subsection above).
./srun_train --model_dir compile
Batch Jobs
- Neocortex asks that their users use a modifiable model script for their batching.
neocortex_model.sbatch
#!/usr/bin/bash #SBATCH --gres=cs:cerebras:1 #SBATCH --ntasks=7 #SBATCH --cpus-per-task=14 newgrp GRANT_ID cp ${0} slurm-${SLURM_JOB_ID}.sbatch # This should be the path in which you are storing your own dataset, if applicable. For example, ${PROJECT}/shared/dataset (that would point to your shared folder under /ocean/project/GRANT_ID/) YOUR_DATA_DIR=${LOCAL}/cerebras/data # This should be the path in which you are storing your own model. YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/ # This should be the place in which the run.py file is located. YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf # These paths are the ones that contain the input dataset and the code files required for your model to run. BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR} CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif cd ${YOUR_ENTRY_SCRIPT_LOCATION} # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate # This should be a single process (1 total number of tasks). srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile # This command will use the default guidance used at the top of this file. In this case, 7 tasks. srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir train --cs_ip ${CS_IP_ADDR}
You can check the status of your submitted job via the squeue command:
squeue -u PSC_USERNAME
Queue specifications
| Name | Purpose | CPUs | GPUs | RAM | Jobs
30 days
|
Wait Time
30-day trend
|
Wall Time
30-day trend
|
|---|---|---|---|---|---|---|---|
| SDF | regular compute | 850,000 Linear Algebra Compute Cores | 40 GB SRAM on-chip memory | — | — | — |