This technical report is intended as a users' guide for the Rice Blue Biou Power7 Cluster. Though most users could benefit from at least some of the content of this report, it is expected that the reader is familiar with moderate to advanced OS and programming topics, including multi-threading, SMP, memory allocation, and debugging.
Because of the combination of what is, to some, an unfamiliar architecture with an infiniband-connected cluster of large SMP nodes, porting applications to Blue Biou can be a challenge. Even if a code compiles and runs, it may not be able to take advantage of the number of processors, threads, and large memory space on the Biou nodes. Unless the code is very well understood, it may be necessary to experiment with different settings to find one that takes advantage of the Power7's multithreading. With small changes to job submission scripts, and in some cases, a simple recompile with new library flags, performance of the average code can be increased significantly.
SMT Intro
The first target for optimization is processor/thread affinity. Incorrect usage of the Power7 multithreading can have the largest detrimental effect on application performance, so the largest gains may be had here.
Each compute node in the Blue Biou cluster is comprised of four Power7 chips. When we refer to sockets, we are talking about an entire Power7 chip. Each socket contains eight cores, and each core contains up to four symmetric multi-threading threads. The threads show up in the operating system as individual processors. Each socket connects to its own region of system RAM, on a node with 256GB, each region is 64GB. Cores on a socket share a layer3 cache and a memory controller that accesses the socket's RAM region. Threads on a core share its execution units, registers, cache, and VSX unit (double precision vector processor)
The SMT feature of the cores can be modified in the case where an application does not gain an advantage from having multiple threads. Highly CPU-bound, and highly memory-bound applications are examples of codes that may not benefit from SMT. It is worth testing within the different modes to determine which best suits your particular application. The mode with all four threads active is SMT=4 or SMT=on. An intermediate mode is available where only two threads are enabled on each core, SMT=2. SMT=off or SMT=1 corresponds to one active thread per core. The per-thread performance benefit of changing SMT modes can be pronounced in an idealized application. Single threads in SMT=4 mode typically perform at 45% the speed of the single thread per core in SMT=1 mode, while single threads in SMT=2 mode can perform at 75% the speed of an SMT=1 thread.
Controlling Processor Affinity on Blue Biou
The take-home point of the above SMT discussion is that it can be quite important which particular threads your application utilizes. The Linux Kernel will, by default, assign tasks to idle "processors" seemingly at random. Since it is NUMA aware, it will try to keep running processes in their same memory region to avoid inter-processor communication where possible, but the task to thread assignment is seemingly random. For example, an openmpi run with np=32 will not seek to place one task per core, but in many cases will utilize multiple threads on some cores, and skip other cores entirely. We recommend one of three possible methods for creating a task to CPU mapping, typically referred to as CPU affinity. For OpenMPI programs, a rank file sets an explicit task to thread affinity that is the most specific of the three methods, and also the most complex to implement. A slightly more general method utilizes task sets, or pools of threads that force the processes of a program to run in a restricted subset of threads, according to a bitmask you supply. For programs that scale well at 32 and 64 threads (or perhaps slightly less), setting the per-node SMT mode is the easiest to implement method of the three.
Setting SMT Modes on Blue Biou
On Blue Biou, we use prologue and epilogue scripts within the Torque scheduler to enable users to control SMT settings in a per-job basis.
Please note that any job that changes the SMT settings should be run with the SINGLEJOB node access policy within the job submission script:
#PBS -W x=NACCESSPOLICY:SINGLEJOB
To set the node to SMT=2 mode, use the following directive in the job submission script:
#PBS -T set_ppc64_smt2
To set the node to SMT=1 mode, use the following directive in the job submission script:
#PBS -T set_ppc64_smt1
Each of these prologue scripts has an associated epilogue script that will return the node's setting to SMT=4 when your job finishes, even in the case where it dies abnormally.
OpenMPI Rank Files
In lieu of SMT settings, there is a method for setting CPU affinity on a MPI task by task basis using a rankfile. Essentially, you run in the default SMT=4 mode
and assign consecutive MPI tasks to the first thread on every core (up to 32 cores, counting by four). Functionally speaking, an MPI task of rank i gets assigned to cpu slot i*4. In the case where you want to emulate SMT=2 behavior, you must assign to the first two threads on each core, e.g. 0, 1, 4, 5, … or 4i – 3(i modulus 2). OpenMPI rank files are discussed in the OpenMPI FAQ here: http://www.open-mpi.de/faq/?category=tuning#using-paffinity
Task Sets
The taskset utility is a basic unix utility that allows a user to retrieve or set a process's CPU affinity. It can be used to change affinity for a running process, but it is more useful to us in its default mode where you give it the command line and arguments to run with a certain affinity mask. The affinity mask is a hexadecimal bitmask representing which processors (with processor 0 on the LSB) should be enabled for the process. On biou, each core is conveniently represented by an individual hexadecimal digit, and the entire mask should be 32 digits long. Useful values, corresponding to our SMT modes are 1 (SMT=1), 3 (SMT=2, threads 0 and 1), and f (SMT=4). An example of using a taskset for a multi-threaded program follows:
taskset 11111111111111111111111111111111 threaded_application –threads=32