Topics in Distributed Systems: Harnessing Massively Parallel Processors

 

Instructor: Matei Ripeanu 

TA: Samer Al-Kiswany

Schedule: Monday 5:00-8:00

Location: KAIS4018

 

Announcements

[12/01] Project ideas posted here (please revisit this page as I’ll keep updating it).

[10/01] Subscribe to the class mailing list either by emailing sympa@ece.ubc.ca with “subscribe eece571R@ece.ubc.ca” in the body of the message or by visiting https://oldlists.ece.ubc.ca/sympa 

Course description

Virtually all semiconductor market domains, including PCs, game consoles, mobile handsets, servers, and supercomputers, are converging to concurrent platforms. There are two important reasons for this trend. First, these concurrent processors can potentially offer more effective use of chip space and power than traditional monolithic microprocessors for many demanding applications. Second, an increasing number of applications that traditionally used Application Specific Integrated Circuits (ASICs) can be implemented using concurrent processors to improve functionality and reduce engineering cost. The real challenge is to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals.

 

The aim of this course is to provide students with knowledge and hands-on experience in developing applications massively parallel processors as well, an understanding of current research topics in this area, and an opportunity to write and critically review research papers. Many commercial offerings from NVIDIA, AMD, and Intel already offer concurrency at massive scale. Effectively programming these processors requires in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors. The target audiences of the course are students who aim to develop exciting applications for these processors, as well as those who aim to develop programming tools and future implementations for these processors.

 

The course will cover fundamental issues related to architecting and building applications that harness massively parallel processors, with a focus on graphical processing units (GPUs): programming models for massive parallelism, application optimization techniques, performance modeling, auto-tuning, compile and runtime support, as well as case studies for specific application-domains.

Course format

The course is structured to provide (i) an in-depth understanding of current topics in distributed/parallel system research; (ii) experience with reviewing and presenting advanced technical material; (iii) exercising writing and critically reviewing research papers. The class workload has a participation component and a final project.

Participation. In each class we discuss two or more research papers. You will have to read the papers before class (be an efficient reader!) and write a review for each paper that includes the following:

1. Summarize the main contribution of the paper

2. Critique the main contribution. 

a.        Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution). More importantly: Motivate your rating in a paragraph or two.

b.        Discuss how convincing the methodology is. You may consider some of the following questions (use what is relevant): Do the claims and conclusions follow from the experiments? Are the assumptions realistic? Are the experiments well designed? Are there different experiments that would be more convincing? Are there other alternatives the authors should have considered? And, of course, is the paper free of methodological errors?

c.         Discuss the most important limitations of the approach?

3. What are the two strongest and/or most interesting ideas in the paper?

4. What are the two most striking weaknesses in the paper?

5. Name two questions that you would like to ask the authors.

6. Detail an interesting extension of the work not mentioned in the future work section.

7. Optional comments on the paper that you’d like to see discussed in class.

Reviews must be submitted by midnight the day before the class to the relevant ‘rotisserie’ on H2O. Papers and your reviews are then discussed in class. Discussions will be lead by one or more students and may include a brief (10-minute) presentation of the paper. Discussion leaders do not need to submit reviews, but they need to prepare a discussion plan.

Project: The final project is an opportunity for hands-on research in parallel/distributed systems. It involves literature survey, programming, running experiments or analytical modeling, analyzing results and writing an up to 8-page report (IEEE conference format). A list of project ideas will be posted, but students are highly encouraged to propose topics of their own interest that match the course focus. Teams of two students are highly recommended. Please see me if you want to form a larger team.  Some past project reports are available here.

 

Schedule for Spring 2011

Previous years’ course schedules can be found here: 2010-Autonomic Computing, 2009-Massively Distributed/Parallel Computing Platforms; 2008-Quality of Service; 2007-Data-intensive Computing Systems.

 

 

Topic / Project steps

Research Papers / Other links

(More to come.  To be split as required/optional)

[W1]

01/10

Introduction. Overview of current research problems. Amdahl’s Law. [slides]

 

 

    A View of the Parallel Computing Landscape, K. Asanović et al., Communications of the ACM, 2009 [link]

    The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, Fernando Pereira, Communication of the ACM, 2009. [link]

    Amdahl's Law in the Multicore Era, Mark D. Hill, Michael R. Marty, IEEE Computer, July 2008. [pdf]

    Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]

    Mitigating Amdahl’s Law through EPI Throttling, Murali Annavaram, Ed Grochowski, John Shen, 298-309, ISCA'05 [pdf]

    Balance Principles for Algorithm-Architecture Co-Design, Kent Czechowski, Casey Battaglino, Chris McClanahan, Aparna Chandramowlishwaran, Richard Vuduc, HotPar’11 [link]

 

[W2]

01/17

Limits of GPU computation.

[slides]

 

[Project: discussion of project themes.]

 

Discussion leader: Abdullah

 

 

Required

    Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU, ISCA'10, [pdf]

Related reading

    On the Limits of GPU Acceleration, HotPar'10, [pdf][slides]

    Believe it or Not: Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!, Rajesh Bordawekar et al., IBM TechReport [pdf]

    nVidia blog answer

    12 Ways to fool the masses …, D. Bailey, Supercomputing Review, 1991  [paper] [update]

 

[W3]

01/24

Regular/irregular parallelism

 

[Project discussion: 5-min presentation of your project idea]

 

 

Required

    How Much Parallelism is There in Irregular Applications?, Milind Kulkarni, Martin Burtscher, R. Inkulu, Keshav Pingali, Calin Cascaval, PPoPP’09 [pdf]

Related reading

    Amorphous parallelism: 1, 2, 3, 4, 5, Galois project, slides

 

[W4]

01/31

Optimizations: tiling, privatization, regularization, compaction, binning, data-layout transformation, thread management. Applications to regular applications (e.g., stencil computations).

[slides]

 

{Tech: Programming models, data parallelism, GPU architecture}

 

Discussion leader: Emalayan

 

Required

    Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU Platforms, S. Venkatasubramanian, R. Vuduc, ICS’09 [pdf]

    Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization, Naghmouchi et al., ICS’10 [pdf]

Related Reading

    Improving Parallelism and Locality with Asynchronous Algorithms, Lixia Liu, Zhiyuan Li, PPoPP’10 [pdf]

    Benchmarking GPUs to tune dense linear algebra, V. Volkov, J. Demmel, SC08 [pdf]

    Implementing sparse matrix-vector multiplication on throughput-oriented processors, Nathan Bell, Michael Garland, SC’09. [link][slides][project page]

    Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on the Tesla Architecture, J. Meng, K. Skadron, ICS’09. [pdf]

    A Tuning Framework for Software-Managed Memory Hierarchies, M. Ren, J. Y.Park, M. Houston, A. Aiken, W. Dally, PACT 2008 [pdf]

    Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications, I. Sung, J. Stratton, W. Hwu, PACT’10 [pdf]

    Tiled MapReduce: Optimizing Resource Usages of Data-parallel Applications on Multicore with Tiling, R. Chen, H. Chen, B. Zang, PACT’10 [pdf]

    Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters, SC08 [pdf][slides]

    Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?, Eddy Z. Zhang, Yunlian Jiang, Xipeng Shen, PPoPP’10 [pdf]

    Other optimization links: 1, 2, 3

 

GPU Architecture Links

    Top 10 Innovations in nVidia Fermi [pdf]

 

[W5]

02/07

Auto-tuning

 

{Tech: CUDA program structure, CUDA threads, memory, performance considerations}

 

 

Discussion leader: Lauro

 

[Project: submit a two-page proposal by Sunday (02/06) midnight]



Required

    Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, K. Datta et al., SC’08, [pdf]

    Partitioning Streaming Parallelism for Multi-cores: A Machine Learning Based Approach, Z. Wang, M. O'Boyle, PACT’10. [pdf] [slides]

Related readings

    Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs, Jee Choi, Amik Singh, Richard Vuduc, PPoPP’10 [pdf]

    Auto-tuning of Fast Fourier Transform on Graphics Processors, Yuri Dotsenko, Sara S. Baghsorkhi, Brandon Lloyd, Naga K. Govindaraju, PPoPP 2011

    Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, Zheng Wang, Michael F.P. O'Boyle, PPoPP’09. [pdf]

    Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling. Basilio Fraguela, Yevgen Voronenko, Markus Puschel, PACT’09 [pdf]

    OpenMPC: Extended OpenMP Programming and Tuning for GPUs, Seyong Lee, Rudolf Eigenmann, SC’10 [link]

    More links: 1, 2, 3, 4, 5

 

02/14

SPRING BREAK

 

[W6]

02/21

Runtime infrastructure. 

 

Discussion leader: Ildar

 

{Tech: Tools for emulation, debugging, profiling}

 

 

Required

    StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010. [link]

    An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf]

Recommended

    Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, V. Ravi et al., HPCD’11 [link]

    TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments, Shinpei Kato, Karthik Lakshmanan and Ragunathan Rajkumar, , Yutaka Ishikawa, Abstract | Full paper See also implementation

    Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems Vishakha Gupta and Karsten Schwan, Niraj Tolia, Vanish Talwar Parthasarathy Ranganathan, Abstract | Full paper

    Cohesion: a hybrid memory model for accelerators, J. Kelm et al., ISCA’10 [pdf]

    A Task-centric Memory Model for Scalable Accelerator Architectures, John Kelm, Daniel Johnson, Steven Lumetta, Matthew Frank, Sanjay Patel, PACT’09 [pdf]

    Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors, J. Gummaraju et al., PACT 2010. [pdf]

    More GPU related projects: StarPU, CristalGPU, MAGMA

 

[W7]

02/28

Transitioning between runtimes and language support

 

[slides]

 

{Tech: OpenCL vs. CUDA}

 

Discussion leader: Mohammad

 

Required

    Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf]

Related readings

    An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, S. Hong, H. Kim, ISCA’09 [pdf]

    The program dependence graph and its use in optimization, J. Ferrante, K. Ottenstein, J. Warren, ACM Transactions on Programming Languages and Systems, 1987 [pdf]

    An Analysis of Linux Scalability to Many Cores, Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich, OSDI’10 [pdf]

    Corey: An Operating System for Many Cores, Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev; Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang, OSDI’08 [pdf]

    Tapping into the Fountain of CPUs - On Operating Systems Support for Programmable Devices, Pete Wyckoff, Muli Ben-Yehuda, Yaron Weinsberg, Danny Dolev, Tal Anker, ASPLOS’08 [pdf]

 

03/07

NO CLASS

 

[W8]

03/14

Project: Midterm report and presentations.

 

Submit an up to four-page midterm report. This should include:  methodology, related work, progress to date. Deadline: Sunday 03/13

[W9]

03/21

Performance modeling.

 

[slides]

 

Discussion leader: Mohammad

 

Required

    An adaptive performance modeling tool for GPU architectures, PPoPP'10 [link]

Related readings

    The program dependence graph and its use in optimization, J. Ferrante, K. Ottenstein, J. Warren, ACM Transactions on Programming Languages and Systems, 1987 [pdf]

 

[W10]

03/28

Language support.  Design patterns.

 

Discussion leader: Jenny

 

[slides]

 

Required:

    Language virtualization for heterogeneous parallel computing, Hassan Chafi et al., OOPSLA’10 [pdf][slides]

    A domain specific approach to heterogeneous parallelism, Hassan Chafi, Arvind Sujeeth, Kevin Brown, Hyouk-Joong Lee, Anand Atreya, Kunle Olukotun, PPoPP’11 [pdf][slides]

Related

    A View of the Parallel Computing Landscape, K. Asanović et al., Communications of the ACM, 2009 [link]

    Merge: A Programming Model for Heterogeneous Multi-core Systems, Michael D. Linderman, Jamison D. Collins, Hong Wang, Teresa H. Meng, ASPLOS’08 [pdf]

    Programming the Memory Hierarchy Revisited: Supporting Irregular Parallelism in Sequoia, Michael Bauer, John Clark, Eric Schkufza, Alex Aiken, PPoPP’11 [pdf]

    Automatic Optimization of Parallel Dataflow Programs, Christopher Olston, Benjamin Reed, Adam Silberstein, Utkarsh Srivastava, USENIX’08 [html]

 

[W11]

04/04

Applications: graph analysis for massive scale graphs

 

Discussion leader: Tim & Bo

 

[slides]

 

 

Required:

    Accelerating CUDA Graph Algorithms at Maximum Warp, Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, Kunle Olukotun, PPoPP’11 [pdf] [slides]

    Scalable Graph Exploration on Multicore Processors, Virat Agarwal et al, SC’10 [pdf]

Related

    Asynchronous Algorithms in MapReduce, Karthik Kambatla, Naresh Rapolu, Suresh Jagannathan, Ananth Grama, CLUSTER 2010. [pdf][slides]

    Massive Social Network Analysis: Mining Twitter for Social Good, David Ediger et al., ICPP’10 [pdf]

    Crunching Large Graphs with Commodity Processors, Jacob Nelson, Brandon Myers, A.H. Hunter, Preston Briggs, Luis Ceze, Carl Ebeling, Dan Grossman, Simon Kahan, Mark Oskin, HotPar’11 [link]

Others

    GraphAnalysis.org; Graph500.org

    Center for Adaptive Supercomputing Software

    Presentations: 2,3,1 4,

    MTAAP workshops: 2011, 2010

 

[W12]

 

Applications: bioinformatics / sequence alignment

 

    CUDAlign: Using GPUs to accelerate the comparison of megabase genomic sequences,   Edans Flavius O. Sandes, Alba Cristina M.A. de Melo, PPoPP’10 [link]

    Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A. Gharaibeh, M. Ripeanu, IEEE/ACM Supercomputing, [pdf]

 

[W13]

04/27

 

[Project: final presentations and wrap-up] 4pm in KAIS 4018

 

 

Exam period: April 11th – 28th.

 

Textbook (recommended): Programming Massively Parallel Processors: A Hands-on Approach, David B. Kirk, Wen-mei WHwu, 2010 [link]

 

Other links:

    HotPar workshops: 2011, 2010

    CS193