CHSI Cluster User Guide | Duke Center for Human Systems Immunology logo

Running Tensorflow in Jupyter on the DCC

CHSI Resources on the DCC

The Duke Compute Cluster consists of machines that the University has provided for community use and that researchers have purchased to conduct their research. The CHSI has high-priority access to 31 servers on the Duke Compute Cluster, comprising 8 virtual nodes configured for GPU computation (RTX 2080 Ti GPU, 1 CPU core [2 threads], 16 GB RAM) and 23 non-GPU virtual nodes for CPU and/or memory intensive tasks (42 CPU cores [84 threads], 700 GB RAM). All nodes utilize Intel Xeon Gold 6252 CPUs @ 2.10GHz.

The operating system and software installation and configuration is standard across all nodes (barring license restrictions), with Red Hat Enterprise Linux 7 the current operating system. SLURM is the scheduler for the entire system. Software is managed by “module” and, increasingly, through the use of Singularity containers, which incorporate an entire software environment and greatly increase reproducibility.

Below are a few useful resources for getting started using the Duke Compute Cluster (DCC):

DCC User Guide
https://rc.duke.edu/dcc/dcc-user-guide/

DCC General Help
https://oit-rc.pages.oit.duke.edu/rcsupportdocs/dcc/

DCC Open OnDemand portal
https://oit-rc.pages.oit.duke.edu/rcsupportdocs/OpenOnDemand/

Dr. Granek's guide to running RStudio / Jupyter using Singularity
https://github.com/duke-chsi-informatics/singularity-rnaseq

Basic Usage

Connecting to the DCC

To connect to the cluster, open a Secure Shell (ssh) session from a terminal using your Duke NetID (replace "NetID" below with your NetID).

ssh NetID@dcc-login.oit.duke.edu

After running the command you will be prompted for your password. If you are on campus (University or DUHS networks) or connected to the Duke network via the Duke VPN then you will be connected to a login node. If you are off-campus and not connected via the Duke VPN, logging in requires 2-factor authentication and you will also be prompted to enter a 2nd password for your DUO passcode.

Running Jobs on CHSI Nodes

To run commands on compute intensive nodes, including the command line options -A chsi -p chsi to specify the "chsi" account (-A chsi) and the "chsi" partition (-p chsi). For example:

srun -A chsi -p chsi --pty bash -i

To run commands on GPU nodes include the command line options -A chsi -p chsi-gpu --gres=gpu:1 --mem=14595 -c 2, where

-A chsi specifies "chsi" account
-p chsi-gpu specifies "chsi-gpu" partition (the nodes that have GPUs)
–gres=gpu:1 to actually request a GPU
–mem=14595 -c 2 to request all available memory and CPUs on the node, since only one jobs can run on a GPU node at a time

For example:

srun -A chsi -p chsi-gpu --gres=gpu:1 --mem=14595 -c 2 --pty bash -i

Useful SLURM Commands

Show the specs of all nodes in "chsi" partition:

sinfo -o "%17N %10T %13C %10m" -p chsi -N

Show the specs of all nodes in "chsi-gpu" partition

sinfo -o "%17N %10T %13C %10m %10G" -p chsi-gpu -N

Show allocated and total memory in "chsi-gpu" partition

scontrol -o show nodes=dcc-chsi-gpu-0[1-8] | awk '{ print $1, $23, $24}'

Show allocated and total memory in "chsi" partition

scontrol -o show nodes=dcc-chsi-[0-2][0-9] | awk '{ print $1, $23, $24}'

Configure SSH Key-Based Authentication

Using SSH keys for authentication avoids the need for multi-factor authentication (MFA) and entering a password to login to the DCC. These instructions work out of the box if your local computer is MacOS or Linux. For Windows 10 you may need to install SSH.

If you don’t already have an SSH key, make one: https://docs.gitlab.com/ee/ssh/README.html#generating-a-new-ssh-key-pair
Add your ssh key to DCC:
1. Go to https://idms-web.oit.duke.edu/portal
2. Click on “Advanced User Options”, then “Update your SSH public keys"
3. Paste in your SSH key and click “Save Changes"

Once this is set up, you can SSH to DCC without needing to do multi-factor authentication, using the VPN, or even entering a password!

Set up SSH shortcuts

Add the following to .ssh/config in your home directory (or create the file if it doesn’t exist already) using a text editor. Substitute your netid for “NETID”.

Host dcc HostName dcc-login.oit.duke.edu User NETID Host dcc1 HostName dcc-login-01.oit.duke.edu User NETID Host dcc2 HostName dcc-login-02.oit.duke.edu User NETID Host dcc3 HostName dcc-login-03.oit.duke.edu User NETID

Once this is set up you can ssh to the DCC login nodes with the command “ssh dcc” (or to a specific DCC login node with “ssh dcc1”, etc).

Storing data on the cluster

There are several different storage options on DCC. Most are discussed at https://rc.duke.edu/dcc/cluster-storage/, but there are a few CHSI specific details below. Please read this and the cluster storage information carefully and be mindful of how you use storage on the cluster.

Shared Scratch Space

The Shared Scratch Space is mounted at /work. To use the shared scratch space, make a subdirectory of /work with your NetID and store your files in that sub-directory. For example, if your NetID was "jdoe28" you would use the command mkdir /work/jdoe28

Group Storage

CHSI has 1 TB of storage at /hpc/group/chsi. While 1 TB seems like a lot, it fills up fast, so please be mindful of how you use this space. Group storage is NOT appropriate for long term storage of large datasets. To use the group storage, make a subdirectory of /hpc/group/chsi with your NetID and store your files in that sub-directory. For example, if your NetID was "jdoe28" you would use the command mkdir /hpc/group/chsi/jdoe28

Local Scratch

Each of the nodes in the chsi partition has 8TB SSD mounted at /scratch. This is in addition to (and different from) the Shared Scratch Space that is at /work. Because /scratch is local to the node, it is potentially faster than the DCC shared storage (Group Storage, Home Directory, and Shared Scratch). However, because /scratch is local to a node, anything stored there is only available on that node. In other words, if you run a job on dcc-chsi-01 and save output to /scratch, it will not be accessible from dcc-chsi-02. As with Shared Scratch, /scratch is not backed up and files are automatically deleted after 75 days.

Archival Storage

Currently the best archival storage option for CHSI users is Duke Data Service (DDS). It currently offers free, unlimited storage. It is not mounted on DCC, but there is a command line tool for moving data to and from DDS. DDS is also a convenient way for moving data around campus.

None of other options discussed above are appropriate for archival storage. Local and Shared Scratch are for short term storage during computation. Our group storage at /hpc/group/chsi/ is limited to 1 TB, which fills up quickly. It is possible to purchase archival storage on Data Commons, but we do not currently have plans to do this.

How Much Space Am I Using

The du command tells you how much space is being used by a directory and its sub-directories. The following command will show the usage of jdoe28's sub-directory on the group storage and each of its sub-directories:

du --all --human-readable --max-depth 1 /hpc/group/chsi/jdoe28

The following will tell you how much space is used and available on the group storage

df -h | egrep 'chsi|Filesystem'

Running Tensorflow in Jupyter on the DCC

From your computer run this to connect to DCC:
ssh NetID@dcc-login-03.oit.duke.edu
Once you are connected run this to start a tmux session:
tmux new -s jupyter
Once you have started a tmux session you can start up Jupyter with this command:
srun -A chsi -p chsi --mem=20G --cpus-per-task=10 singularity run docker://jupyter/tensorflow-notebook

Running this command will take a while and will print a bunch of stuff. You can ignore everything except the last two lines, which will say something like:

http://dcc-chsi-01:8889/?token=08172007896ad29bb5fbd92f6f3f516a8b2f7303ed7f1df3 or http://127.0.0.1:8889/?token=08172007896ad29bb5fbd92f6f3f516a8b2f7303ed7f1df3

You need this information for the next few steps. For the next step you need the "dcc-chsi-01:8889” part.
"dcc-chsi-01” is the compute node that Jupyter is running on and “8889” is the port it is listening on. You can get a different value every time you start the container.
You want to run the following command in another terminal on your computer to set up port forwarding.
ssh -L PORT:NODE.rc.duke.edu:PORT NetID@dcc-login-03.oit.duke.edu

In this command you want to replace “PORT” with the value you got for port from the srun command and replace "NODE" with the compute node that was printed by the srun command. So for the example above, the ssh port forwarding command would be:

ssh -L 8889:dcc-chsi-01.rc.duke.edu:8889 NetID@dcc-login-03.oit.duke.edu
Now you can put the last line that the srun command printed in your web browser and it should open your Jupyter instance running on DCC.

A few notes:

The Jupyter session keeps running until you explicitly shut it down. If the port forwarding SSH connection drops you will need to restart SSH with the same command, but you don’t need to restart Jupyter.
To explicitly shut down Jupyter you need to do control-C twice in the terminal where you started Jupyter. If this connection dropped, you can reconnect to it with:
ssh NetID@dcc-login-03.oit.duke.edu tmux a -t jupyter
If you need more memory or more cpus you can change the values for --mem or --cpus-per-task in the “srun” command.
All the CHSI gpus are in their own nodes, so if you want to use a GPU, then the command is slightly different:
srun -A chsi -p chsi-gpu --gres=gpu:1 --mem=14595 --cpus-per-task 2 singularity run --nv docker://jupyter/tensorflow-notebook

The GPU nodes have limited memory and CPUs, so these values are the maximums