The goal of the research in my group is to influence the architecture of future generation microprocessors and GPUs that will be in the market place five to ten years from now. Detailed descriptions of past and ongoing research projects are listed below.
However, note this page is chronically out of date. Please look at my publications page for the most up-to-date information on what my research group is doing.
Funding from the following companies and organizations is greatly appreciated (in alphabetical order):
Summary of Recent and Ongoing Research
High level descriptions of recent research undertaken by my group are provided below.
Before describing individual projects it may be useful to first understand the methodology employed in the type of research done by my group. A short summary of which appears below.
Since it is possible today to build a single chip with over a billion transistors, influencing future architectures means evaluating hardware designs that correspond to chips with 10's of billions of transistors. Following the practice of processor architects in industry at companies such as Intel, AMD, IBM, Sun and NVIDIA we typically do this by modeling chips at a higher level of abstraction than the hardware design approaches taught in undergraduate electrical and computer engineering curriculums. For example, we develop C++ based discrete event simulators that predict the number of clock cycles it takes to run a software application on a novel processor architecture. Specifically, over the past few years my group has developed GPGPU-Sim, the first academic detailed timing simulator for GPU computing architectures. Note that clock frequency is not the same thing as performance--modern microprocessors overlap many operations in parallel to improve performance. The accuracy of these simulators is sufficient to predict important trends so that architecture design decisions affecting the entire chip can be made early in the design process. This is enough to decide for example, whether adding a cache will help performance (and how much it will help). I only use cache as an example since many people have heard this term and have at least a vague notion of what a cache does. There are many other features of the architecture and microarchitecture of modern microprocessors and GPUs that determine the performance that can be obtained using a given semiconductor processor technology and a particular set of circuit design techniques.
Graphic Processor (GPU) Architecture for non-graphics applications
Recently there has been much interest in using graphics processors (GPUs) for accelerating non-graphics applications.
My group has explored several approaches to enabling a wider range of applications to benefit from GPUs or otherwise enhancing GPUs as described below:
We introduced dynamic warp formation (see: C.5, J.2), a mechanism that improves performance for compute shaders making use of branch instructions. Recent GPUs use a form of instruction execution called, single instruction, multiple thread (SIMT) in which a single instruction is executed on several function units in parallel. Here the term "warp" refers to the collection of individual threads executing in lock step on these function units (the concept is very similar to SIMD, but not exactly the same). When a branch is encountered, some function units are "masked off" leading to inefficiency. Dynamic warp formation exploits the use of fine-grained multithreading to group together these "diverged" warps into new warps that better utilize the function units. Here is video of a talk I gave at Microsoft Research describing this work. Wilson Fung, recently presented a follow-up paper on a much improved version of dynamic warp formation called thread block compaction at HPCA 2011.
On-Chip Interconnection Networks
In collaboration with John Kim at KAIST, we have also explored how best to design the interconnection network of GPUs and Ali Bakhoda will be presenting some of our work at MICRO-43 (see C.14, C.13). In this work we proposed a novel interconnect with a checkerboard pattern that takes advantage of the observed many-to-few traffic pattern in GPU compute workloads to deliver higher overall throughput per unit area.
Memory Access Scheduling
In 2009 we presented work exploring the impact of DRAM memory system of GPUs (see C.10). Reading and writing data to/from DRAM involves greater overheads than reading/writing to SRAM. The principle concern is the need to transfer data from the DRAM cells into a "row buffer" before reading and writing, and then back again after. These overheads can be reduced by scheduling DRAM read/write operations. As Moore's Law continues and the number of transistors grows much faster than off-chip pin bandwidth, performance is becoming increasingly limited by effective memory bandwidth. This paper explores how the on-chip interconnection network makes effective use of off-chip bandwidth more challenging and proposes modifiying the on-chip interconnect to avoid having to increase complexity (area, cycle time) of the memory controller scheduler to compensate as more "cores" are integrated on a single chip.
Complex data structure supports on Graphics Processor Units
This research focuses on optimizations for reducing the off-chip bandwidth bottleneck (financial support from AMD). More details soon!
Transactional Memory for GPU Architectures
The goal of this work is to make it easier to develop applications on GPUs by supporting improved programming models. A key focus of our efforts is exploration of how to apply the transactional memory programming model to GPUs. Our first paper on this research was presented in Porto Alegre, Brazil in December 2011 (details here). This work was selected as one of 12 top computer architecture papers in 2011 by IEEE Micro.
This project is supported via an NSERC Strategic Projects grant with supporting organizations Intel, AMD, and NVIDIA and research collaborators at the University of Toronto.
You may also want to read about our simulator, GPGPU-Sim. GPGPU-Sim runs unmodified CUDA (and OpenCL) applications by emulating NVIDIA's PTX virtual instruction set. The following paper introduces this version of GPGPU-Sim and describes an analysis of several CUDA applications. Our group will be presenting another paper on a visualizer tool, AerialVision, that enables researchers to visualize the dynamics of CUDA applications running on GPGPU-Sim.
We have explored heterogeneous multicore processor architectures including one called Pangaea, a novel heterogeneous architecture developed with researcher's at Intel's Microarchitecture Research Lab. Pangaea integrates production quality RTL of a GPU and a x86 CPU on a single FPGA and demonstrates that the latency with which the CPU can start a thread on the GPU can be reduced to 10's of cycles.
We followed this work up with a limit study exploring the impact on performance of integrating a CPU and a GPU into a single chip. This study uncovered counter-intuitive results in that the performance gains from integration were fairly modest (compared with what we were expecting). The intuition seems to be that real programs either have enough parallelism to hide the latency of switching between CPU and GPU, or they are too serial so that such switching is not particularly beneficial. Since this was a limit study, we feel the topic deserves more research.
Analytical Modeling of Microprocessor Performance
Another area of research is analytical modeling of processor architectures. In 2008 we published work on accurately accounting for memory level parallelism and data prefetching effects in superscalar out-of-order microprocessors (see C.7) and followed this up with a technique for accurately modeling the throughput of fine-grained multithreaded architectures such as Sun Microsystem's Niagara processor (see C.8).
As part of my graduate research I developed an analytical model for use in optimizing helper threads (see the section below on helper threads). This analytical model is now patented technology (assignee Intel Corporation)!
It is growing increasingly difficult to find all hardware bugs before a chip "tapes out". The process of finding design bugs after a chip is manufactured is known as post-silicon debug. This project involves a collaboration with researchers in computer science and circuit testing on a framework called "Backspace". The idea of backspace is to start from where a bug is detected and using formal verification techniques construct a possible execution trace that would lead to this erroneous state. Some of our initial work in this area explored how to identify which of many possible predecessor states is the most likely one when using formal methods to construct a traces backwards from a crash state. More recently we proposed a way to avoid needing to explicitly search all predecessor states consistent with the hardware RTL description. For details on this project, see: [C.11, J.5, W.8].
Older Research (Prior to Tor joining UBC)
My PhD research explored a topic known as helper threads.
First, I focused on what I called "prescient
instruction prefetch" helper threads. A description of
that work can be found in the following two papers: [C.2, C.3].
A later study applying related techniques for modeling program
execution to data prefetch helper threads can be found [C.4].
The idea behind helper threads is leveraging
multithreaded hardware to run short threads that
consist of a few 10's to 100's of instructions to do
microarchitectural work on behalf of a single-threaded application's 'main thread'.
(An example of 'multithreaded hardware' that I had in mind at the time is 'Hyper-Threading' on the Intel Pentium 4, and to a somewhat less extent, single chip multiprocessors such as the Intel 'Core Duo').
Prescient instruction prefetch helper threads improve single-threaded application performance primarily by prefetching instructions in a timely and judicious manner (i.e., 'presciently'). The idea of prefetching is to bring in instructions or data into the cache before an application needs them.
Why go to the trouble of using a entire thread just to eliminate some cache misses? At the time we were doing this research industry roadmaps called for processor clock frequencies to reach 10's of GHz (that didn't happen due to power considerations). This increase in clock frequency causes memory latencies for cache misses to increase. As these miss latencies increase it is increasingly difficult to accurately predict which instructions need to be prefetched sufficiently early to tolerate the memory latencies of future generation microprocessors using traditional techniques--especially when executing programs with large instruction memory footprints and complex control flow. It is not inconceivable that a cache miss all the way to main memory will one day take 1000's of clock cycles to finish, during which the hardware is largely idle. If the only meaningful work in the system is the work from a thread that frequently stalls waiting for such long latency cache misses, it makes sense to use idle hardware reources to reduce the time this important work must spend simply waiting for information to move from one place to another. In current generation microprocessors a significant amount of performance is already lost due to cache misses. Prescient instruction prefetch helper threads overcome these limitations by 'mimicing' a subset of the behavior of the application's main thread in enough detail to be able to accurately anticipate which instructions will be executed by the main thread before they are actually needed. The first paper, presented at SIGMETRICS-2003, outlines an analytical framework for optimizing the use of these helper threads. This framework applies Tarjan's path expression algorithm to perform a statistical analysis of program behavior to help optimize the selection of helper threads. The second paper, presented at HPCA-10, describes two alternative implementation techniques for prescient instruction prefetch along with a few associated hardware support mechanisms used to improve their impact. This work resulted in five issued patents.
Floating-point to fixed-point conversion
A description of my M.A.Sc. thesis research on automatic floating-point to fixed-point translation can be found in the following journal paper: [J.1]. The dissertation itself can be found here. The main highlights of this work are a profile based conversion algorithm able to convert floating-point arithmetic (including division) into "fixed-point" starting from an ANSI C description of a program, as well as a novel instruction set extension and associated code generation algorithm for improving the signal-to-noise ratio of the resulting fixed point executable. By fixed-point I generally mean an integer arithmetic approximation not guaranteed to exactly reproduce the original floating-point algorithm, but far more efficient than emulating IEEE floating-point precisely in software when floating-point hardware is not available--such as in many embedded applications. The conversion algorithm uses dataflow analysis techniques to handle pointer accesses to floating-point types.