Topics in
Distributed Systems: Harnessing Massively Parallel Processors
Instructor: Matei Ripeanu
TA: Samer Al-Kiswany
Schedule: Monday 5:00-8:00
Location: KAIS4018
[12/01] Project ideas posted here
(please revisit this page as I’ll keep updating it).
[10/01] Subscribe to the class
mailing list either by emailing sympa@ece.ubc.ca
with “subscribe eece571R@ece.ubc.ca” in the body of the message or by visiting https://oldlists.ece.ubc.ca/sympa
Virtually all semiconductor market domains, including PCs,
game consoles, mobile handsets, servers, and supercomputers, are converging to
concurrent platforms. There are two important reasons for this trend. First,
these concurrent processors can potentially offer more effective use of chip
space and power than traditional monolithic microprocessors for many demanding
applications. Second, an increasing number of applications that traditionally
used Application Specific Integrated Circuits (ASICs)
can be implemented using concurrent processors to improve functionality and
reduce engineering cost. The real challenge is to develop applications software
that effectively uses these concurrent processors to achieve efficiency and
performance goals.
The aim of this course is to provide students with
knowledge and hands-on experience in developing applications massively parallel
processors as well, an understanding of current research topics in this area,
and an opportunity to write and critically review research papers. Many
commercial offerings from NVIDIA, AMD, and Intel already offer concurrency at
massive scale. Effectively programming these processors requires in-depth
knowledge about parallel programming principles, as well as the parallelism
models, communication models, and resource limitations of these processors. The
target audiences of the course are students who aim to develop exciting
applications for these processors, as well as those who aim to develop
programming tools and future implementations for these processors.
The course will cover fundamental issues related to
architecting and building applications that harness massively parallel
processors, with a focus on graphical processing units (GPUs):
programming models for massive parallelism, application optimization
techniques, performance modeling, auto-tuning, compile and runtime support, as
well as case studies for specific application-domains.
The course is structured to provide (i) an in-depth
understanding of current topics in distributed/parallel system research; (ii) experience
with reviewing and presenting advanced technical material; (iii) exercising
writing and critically reviewing research papers. The class workload has a participation component and a final project.
1. Summarize the main contribution of
the paper
2. Critique the main
contribution.
a.
Rate
the significance of the paper on a scale of 5 (breakthrough), 4
(significant contribution), 3 (modest contribution), 2 (incremental
contribution), 1 (no contribution or negative contribution). More importantly:
Motivate your rating in a paragraph or two.
b.
Discuss
how convincing the methodology is. You may consider some of the
following questions (use what is relevant): Do the claims and conclusions
follow from the experiments? Are the assumptions realistic? Are the experiments
well designed? Are there different experiments that would be more convincing?
Are there other alternatives the authors should have considered? And, of
course, is the paper free of methodological errors?
c.
Discuss
the most important limitations of the approach?
3. What are the two strongest and/or
most interesting ideas in the paper?
4. What are the two most striking
weaknesses in the paper?
5. Name two questions that you would
like to ask the authors.
6. Detail an interesting extension of
the work not mentioned in the future work section.
7. Optional comments on the paper that
you’d like to see discussed in class.
Reviews
must be submitted by midnight the day before the class to the
relevant ‘rotisserie’ on H2O.
Papers and your reviews are then discussed in class. Discussions will be lead
by one or more students and may include a brief (10-minute) presentation of the
paper. Discussion leaders do not need to submit
reviews, but they need to prepare a discussion plan.
Schedule for Spring
2011
Previous years’
course schedules can be found here: 2010-Autonomic Computing, 2009-Massively Distributed/Parallel
Computing Platforms; 2008-Quality
of Service; 2007-Data-intensive
Computing Systems.
|
|
Topic / Project
steps |
Research Papers /
Other links (More to come. To be split as
required/optional) |
|
[W1] 01/10 |
Introduction. Overview of current research problems.
Amdahl’s Law. [slides] |
•
A View of the Parallel Computing Landscape, K. Asanović
et al., Communications of the ACM, 2009 [link] •
The Unreasonable Effectiveness of Data, Alon
Halevy, Peter Norvig, Fernando Pereira,
Communication of the ACM, 2009. [link] •
Amdahl's Law in the Multicore Era, Mark D. Hill, Michael R. Marty, IEEE Computer, July
2008. [pdf] •
Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf] •
Mitigating Amdahl’s Law through EPI Throttling, Murali Annavaram,
Ed Grochowski, John Shen,
298-309, ISCA'05 [pdf] •
Balance Principles for
Algorithm-Architecture Co-Design, Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna
Chandramowlishwaran, Richard Vuduc,
HotPar’11 [link] |
|
[W2] 01/17 |
Limits of GPU computation. [slides] [Project: discussion of project themes.] Discussion leader: Abdullah |
Required •
Debunking the 100X GPU vs. CPU
myth: an evaluation of throughput computing on CPU and GPU,
ISCA'10, [pdf] Related reading •
On the Limits of GPU Acceleration,
HotPar'10, [pdf][slides] •
Believe it or Not: Multi-core
CPUs Can Match GPU Performance for FLOP-intensive Application!,
Rajesh Bordawekar et al., IBM TechReport
[pdf] •
nVidia blog
answer •
12 Ways to fool the masses …, D. Bailey,
Supercomputing Review, 1991 [paper] [update] |
|
[W3] 01/24 |
Regular/irregular parallelism [Project
discussion: 5-min
presentation of your project idea] |
Required •
How Much Parallelism is There in
Irregular Applications?, Milind Kulkarni, Martin Burtscher, R. Inkulu, Keshav Pingali, Calin Cascaval, PPoPP’09 [pdf] Related reading •
Amorphous parallelism: 1, 2,
3,
4,
5, Galois
project, slides |
|
[W4] 01/31 |
Optimizations: tiling, privatization, regularization,
compaction, binning, data-layout transformation, thread management.
Applications to regular applications (e.g., stencil computations). [slides] {Tech: Programming
models, data parallelism, GPU architecture} Discussion leader: Emalayan |
Required •
Tuned and Wildly Asynchronous
Stencil Kernels for Heterogeneous CPU/GPU Platforms, •
Small-ruleset
regular expression matching on GPGPUs: quantitative
performance analysis and optimization, Naghmouchi
et al., ICS’10 [pdf] Related •
Improving Parallelism and Locality with Asynchronous Algorithms, Lixia Liu, Zhiyuan
Li, PPoPP’10 [pdf] •
Benchmarking GPUs
to tune dense linear algebra, V. Volkov, J. Demmel, SC08 [pdf] •
Implementing sparse matrix-vector multiplication on
throughput-oriented processors, Nathan Bell, Michael Garland,
SC’09. [link][slides][project page] •
Performance Modeling and
Automatic Ghost Zone Optimization for Iterative Stencil Loops on the Tesla
Architecture, J. Meng, K. Skadron, ICS’09. [pdf] •
A Tuning Framework for
Software-Managed Memory Hierarchies, M. Ren, J. Y.Park, M. Houston, A. Aiken, W. Dally, PACT 2008 [pdf] •
Data Layout Transformation
Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications, •
Tiled MapReduce:
Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling, R. Chen, H.
Chen, B. Zang, PACT’10 [pdf] • Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters, SC08 [pdf][slides] •
Does Cache Sharing on Modern CMP Matter to the Performance of
Contemporary Multithreaded Programs?, Eddy Z. Zhang, Yunlian
Jiang, Xipeng Shen,
PPoPP’10 [pdf] •
Other
optimization links: 1,
2, 3 GPU Architecture Links •
Top 10 Innovations in nVidia Fermi [pdf] |
|
[W5] 02/07 |
Auto-tuning {Tech: CUDA program
structure, CUDA threads, memory, performance considerations} Discussion leader: Lauro [Project: submit a two-page proposal by
Sunday (02/06) midnight]
|
Required •
Stencil computation optimization and auto-tuning on state-of-the-art
multicore architectures, K. Datta
et al., SC’08, [pdf] •
Partitioning Streaming Parallelism for Multi-cores: A Machine
Learning Based Approach, Z. Wang, M. O'Boyle, PACT’10. [pdf]
[slides] Related readings •
Model-driven Autotuning of Sparse
Matrix-Vector Multiply on GPUs, Jee Choi, Amik Singh, Richard Vuduc,
PPoPP’10 [pdf] •
Auto-tuning of Fast Fourier Transform on Graphics Processors, Yuri Dotsenko,
Sara S. Baghsorkhi, Brandon Lloyd, Naga K. Govindaraju, PPoPP 2011 •
Mapping Parallelism to Multi-cores: A Machine Learning Based
Approach, Zheng Wang, Michael F.P. O'Boyle, PPoPP’09. [pdf] •
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical
Modeling. Basilio Fraguela, Yevgen Voronenko, Markus Puschel, PACT’09 [pdf] •
OpenMPC: Extended OpenMP Programming and Tuning
for GPUs, Seyong Lee, Rudolf Eigenmann, SC’10 [link] |
|
02/14 |
SPRING BREAK |
|
|
[W6] 02/21 |
Runtime infrastructure.
Discussion leader: Ildar {Tech: Tools
for emulation, debugging, profiling} |
Required •
StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet,
Samuel Thibault, and Raymond Namyst.
TR-7240, INRIA, March 2010. [link] •
An Asymmetric Distributed Shared Memory Model for Heterogeneous
Parallel Systems,
Isaac Gelado, Javier Cabezas,
John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf] Recommended •
Supporting GPU Sharing in Cloud
Environments with a Transparent Runtime Consolidation Framework, V. Ravi et al., HPCD’11 [link] •
TimeGraph: GPU
Scheduling for Real-Time Multi-Tasking Environments, Shinpei
Kato, Karthik Lakshmanan and Ragunathan
Rajkumar, , Yutaka Ishikawa, Abstract
| Full
paper See also implementation •
Pegasus:
Coordinated Scheduling for Virtualized Accelerator-based Systems Vishakha
Gupta and Karsten Schwan, Niraj Tolia,
Vanish Talwar Parthasarathy
Ranganathan, Abstract
| Full
paper •
Cohesion: a hybrid memory model
for accelerators,
J. Kelm et al., ISCA’10 [pdf] •
A Task-centric
Memory Model for Scalable Accelerator Architectures, •
• More GPU related projects: StarPU, CristalGPU, MAGMA |
|
[W7] 02/28 |
Transitioning between runtimes and language support [slides] {Tech: OpenCL
vs. CUDA} Discussion leader: Mohammad |
Required •
Achieving a Single Compute Device Image in OpenCL
for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Related readings •
An analytical model for a GPU architecture with memory-level and
thread-level parallelism awareness, •
The program dependence graph and
its use in optimization, J. Ferrante, K. Ottenstein,
J. Warren, ACM Transactions on Programming Languages and Systems, 1987 [pdf] •
An Analysis of Linux Scalability to Many Cores, Silas Boyd-Wickizer,
Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai
Zeldovich, OSDI’10 [pdf] •
Corey: An Operating System for
Many Cores,
Silas Boyd-Wickizer, Haibo
Chen, Rong Chen, Yandong
Mao, Frans Kaashoek,
Robert Morris, Aleksey Pesterev; Lex Stein, Ming Wu, Yuehua Dai,
Yang Zhang, Zheng Zhang, OSDI’08 [pdf] •
Tapping into the Fountain of CPUs
- On Operating Systems Support for Programmable Devices, Pete Wyckoff, Muli Ben-Yehuda,
Yaron Weinsberg, Danny Dolev, Tal Anker, ASPLOS’08 [pdf] |
|
03/07 |
NO CLASS
|
|
|
[W8] 03/14 |
Project: Midterm report and
presentations. |
Submit an up to four-page midterm report. This should
include: methodology, related work,
progress to date. Deadline: Sunday 03/13 |
|
[W9] 03/21 |
Performance modeling. [slides] Discussion leader: Mohammad |
Required •
An adaptive performance modeling tool for GPU architectures, PPoPP'10 [link] Related readings •
The program dependence graph and
its use in optimization, J. Ferrante, K. Ottenstein,
J. Warren, ACM Transactions on Programming Languages and Systems, 1987 [pdf] |
|
[W10] 03/28 |
Language
support. Design patterns. Discussion leader: Jenny [slides] |
Required: •
Language virtualization for heterogeneous parallel computing, Hassan Chafi
et al., OOPSLA’10 [pdf][slides] •
A domain specific approach to heterogeneous parallelism, Hassan Chafi,
Arvind Sujeeth, Kevin
Brown, Hyouk-Joong Lee, Anand
Atreya, Kunle Olukotun, PPoPP’11 [pdf][slides] Related •
A View of the Parallel Computing Landscape, K. Asanović
et al., Communications of the ACM, 2009 [link] •
Merge: A Programming Model for
Heterogeneous Multi-core Systems, Michael D. Linderman, Jamison D. Collins, Hong Wang, Teresa H. Meng, ASPLOS’08 [pdf] • Programming the Memory Hierarchy
Revisited: Supporting Irregular Parallelism in Sequoia, Michael Bauer, John Clark, Eric
Schkufza, Alex Aiken, PPoPP’11 [pdf] • Automatic Optimization of Parallel Dataflow Programs, Christopher Olston, Benjamin Reed, Adam Silberstein, Utkarsh Srivastava, USENIX’08 [html] |
|
[W11] 04/04 |
Applications: graph analysis for massive scale graphs Discussion leader: Tim & Bo [slides] |
Required: • Accelerating CUDA Graph Algorithms at
Maximum Warp,
Sungpack Hong, Sang Kyun
Kim, Tayo Oguntebi, Kunle Olukotun, PPoPP’11 [pdf]
[slides] •
Scalable
Graph Exploration on Multicore Processors, Related •
Asynchronous Algorithms in MapReduce, Karthik Kambatla, Naresh Rapolu, Suresh Jagannathan, Ananth Grama, CLUSTER 2010. [pdf][slides] •
Massive Social Network Analysis:
Mining Twitter for Social Good, David Ediger et al., ICPP’10 [pdf] •
Crunching Large Graphs with Commodity Processors, Jacob
Nelson, Brandon Myers, A.H. Hunter, Preston Briggs, Luis Ceze,
Carl Ebeling, Dan Grossman, Simon Kahan, Mark Oskin, HotPar’11 [link] Others •
GraphAnalysis.org;
Graph500.org •
Center
for Adaptive Supercomputing Software |
|
[W12] |
Applications: bioinformatics / sequence alignment |
•
CUDAlign: Using GPUs
to accelerate the comparison of megabase genomic
sequences, Edans Flavius O. Sandes, Alba Cristina M.A. de Melo,
PPoPP’10 [link] • Size Matters: Space/Time
Tradeoffs to Improve GPGPU Applications Performance, A. Gharaibeh, M. Ripeanu,
IEEE/ACM Supercomputing, [pdf] |
|
[W13] 04/27 |
[Project: final presentations and wrap-up] 4pm in
KAIS 4018 |
|
Exam
period: April 11th – 28th.
Textbook
(recommended): Programming
Massively Parallel Processors: A Hands-on Approach, David B. Kirk, Wen-mei WHwu, 2010 [link]
Other links: