教学资源 – 图书教辅

扩展信息

语种 : 英语

页数 : 1026

开本 : 16开

原书名 : Parallel Computer Architecture: A Hardware/Software Approach

原出版社: Morgan Kaufmann Publishers

属性分类: 教材

包含CD : 无

绝版 : 已绝版

图书简介

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.

作者简介

David E.Culler：David E.Culler: 于1989年获得麻省理工学院博士学位，目前是加州大学伯克利分校计算机科学教授，电气工程与计算机科学系副主任。他是千年（Millennium）项目的负责人，从事校园网范围的集群系统的研究。他的研究领域包括计算机体系结构、通信、编程语言、操作系统和性能分析。他曾是伯克利工作站网络（NOW）项目的领导者。该项目点燃了当前在全球范围发展的高性能集群系统商业化的革命之火。Culler博士因为在快速通信的主动消息、LogP并行性能模型、Split-C并行语言、TAM线程抽象机和数据流体系结构方面的研究而闻名于世。他曾被美国国家科学基金会授予总统学术研究基金奖和总统青年研究人员奖。
Jaswinder Pal Singh: 普林斯顿大学计算机科学系助理教授。他的研究领域是并行应用和多处理器系统，包括体系结构、软件和性能评价。他领导开发了在并行系统领域广为使用的SPLASH和SPLASH-2并行程序组件。他在斯坦福大学攻读硕士和博士学位期间，参加了DASH和FLASH多处理器项目，是该项目应用开发方面的负责人。DASH项目开发的技术现在广泛地应用于商业产品中。在普林斯顿，他领导着PRISM项目组，该项目组以应用为驱动，致力于研究支持不同通信体系结构的编程模型，并将并行计算应用于各种不同的领域。他曾获得过总统科学家与工程师初期成就奖（PECASE）和Sloan研究基金奖。
Anoop Gupta: 1986年获得卡内基-梅隆大学的博士学位，目前是斯坦福大学电气工程与计算机科学系副教授、微软高级研究员，还是斯坦福大学的Robert Noyce学院院长。他的研究领域包括计算机体系结构、操作系统、编程语言、性能调试工具和并行应用。他和John Hennessy共同领导了斯坦福DASH机的设计和构建，DASH机是世界上最早的可扩展的分布式共享内存的多处理器之一。Gupta博士也参加了之后的FLASH项目。DASH项目开发的技术现在广泛地应用在商业产品中。Gupta教授曾获得美国国家科学基金会授予的总统青年研究人员奖。

Jaswinder Pal Singh：

Anoop Gupta：

图书目录

1 Introduction
1.1 Introduction
1.2 Why Parallel Architecture
1.2.1 Application Trends
1.2.2 Technology Trends　
1.2.3 Architectural Trends　
1.2.4 Supercomputers
1.2.5 Summary　
1.3 Convergence of Parallel Architectures　
1.3.1 Communication Architecture　
1.3.2 Shared Memory
1.3.3 Message-Passing
1.3.4 Convergence　
1.3.5 Data Parallel Processing
1.3.6 Other Parallel Architectures
1.3.7 A Generic Parallel Architecture
1.4 Fundamental Design Issues
1.4.1 Communication Abstraction
1.4.2 Programming Model Requirements　
1.4.3 Naming　
1.4.4 Ordering　
1.4.5 Communication and Replication　
1.4.6 Performance
1.5 Concluding Remarks　
1.6 References
1.7 Exercises
2 Parallel Programs　
2.1 Introduction
2.2 Parallel Application Case Studies
2.2.1 Simulating Ocean Currents
2.2.2 Simulating the Evolution of Galaxies
2.2.3 Visualizing Complex Scenes using Ray Tracing　
2.2.4 Mining Data for Associations
2.3 The Parallelization Process　
2.3.1 Steps in the Process　
2.3.2 Parallelizing Computation versus Data　
2.3.3 Goals of the Parallelization Process
2.4 Parallelization of an Example Program　
2.4.1 A Simple Example: The Equation Solver Kernel　
2.4.2 Decomposition
2.4.3 Assignment
2.4.4 Orchestration under the Data Parallel Model
2.4.5 Orchestration under the Shared Address Space Model
2.4.6 Orchestration under the Message Passing Model
2.5 Concluding Remarks
2.6 References
2.7 Exercises
3 Programming for Performance
3.1 Introduction
3.2 Partitioning for Performance
3.2.1 Load Balance and Synchronization Wait Time
3.2.2 Reducing Inherent Communication
3.2.3 Reducing the Extra Work
3.2.4 Summary
3.3 Data Access and Communication in a Multi-Memory System
3.3.1 A Multiprocessor as an Extended Memory Hierarchy
3.3.2 Artifactual Communication in the Extended Memory Hierarchy
3.4 Orchestration for Performance
3.4.1 Reducing Artifactual Communication
3.4.2 Structuring Communication to Reduce Cost
3.5 Performance Factors from the ProcessorsÕ Perspective
3.6 The Parallel Application Case Studies: An In-Depth Look
3.6.1 Ocean
3.6.2 Barnes-Hut
3.6.3 Raytrace
3.6.4 Data Mining
3.7 Implications for Programming Models　
3.8 Concluding Remarks
3.9 References
3.10 Exercises
4 Workload-Driven Evaluation
4.1 Introduction
4.2 Scaling Workloads and Machines
4.2.1 Why Worry about Scaling
4.2.2 Key Issues in Scaling
4.2.3 Scaling Models
4.2.4 Scaling Workload Parameters
4.3 Evaluating a Real Machine
4.3.1 Performance Isolation using Microbenchmarks
4.3.2 Choosing Workloads
4.3.3 Evaluating a Fixed-Size Machine
4.3.4 Varying Machine Size
4.3.5 Choosing Performance Metrics
4.3.6 Presenting Results
4.4 Evaluating an Architectural Idea or Tradeoff
4.4.1 Multiprocessor Simulation
4.4.2 Scaling Down Problem and Machine Parameters for Simulation
4.4.3 Dealing with the Parameter Space: An Example Evaluation
4.4.4 Summary
4.5 Illustrating Workload Characterization
4.5.1 Workload Case Studies
4.5.2 Workload Characteristics
4.6 Concluding Remarks
4.7 References
4.8 Exercises
5 Shared Memory Multiprocessors
5.1 Introduction
5.2 Cache Coherence
5.2.1 The Cache Coherence Problem
5.2.2 Cache Coherence Through Bus Snooping
5.3 Memory Consistency
5.3.1 Sequential Consistency
5.3.2 Sufficient Conditions for Preserving Sequential Consistency
5.4 Design Space for Snooping Protocols
5.4.1 A 3-state (MSI) Write-back Invalidation Protocol
5.4.2 A 4-state (MESI) Write-Back Invalidation Protocol
5.4.3 A 4-state (Dragon) Write-back Update Protocol
5.5 Assessing Protocol Design Tradeoffs
5.5.1 Workloads
5.5.2 Impact of Protocol Optimizations
5.5.3 Tradeoffs in Cache Block Size
5.5.4 Update-based vs. Invalidation-based Protocols
5.6 Synchronization
5.6.1 Components of a Synchronization Event
5.6.2 Role of User, System Software and Hardware
5.6.3 Mutual Exclusion
5.6.4 Point-to-point Event Synchronization
5.6.5 Global (Barrier) Event Synchronization
5.6.6 Synchronization Summary
5.7 Implications for Software
5.8 Concluding Remarks
5.9 References
5.10 Exercises
6 Snoop-based Multiprocessor Design
6.1 Introduction
6.2 Correctness Requirements
6.3 Base Design: Single-level Caches with an Atomic Bus
6.3.1 Cache controller and tags
6.3.2 Reporting snoop results
6.3.3 Dealing with Writebacks
6.3.4 Base Organization
6.3.5 Non-atomic State Transitions
6.3.6 Serialization
6.3.7 Deadlock　
6.3.8 Livelock and Starvation
6.3.9 Implementing Atomic Primitives
6.4 Multi-level Cache Hierarchies
6.4.1 Maintaining inclusion
6.4.2 Propagating transactions for coherence in the hierarchy
6.5 Split-transaction Bus
6.5.1 An Example Split-transaction Design
6.5.2 Bus Design and Request-response Matching
6.5.3 Snoop Results and Conflicting Requests
6.5.4 Flow Control
6.5.5 Path of a Cache Miss
6.5.6 Serialization and Sequential Consistency
6.5.7 Alternative design choices
6.5.8 Putting it together with multi-level caches
6.5.9 Supporting Multiple Outstanding Misses from a Processor
6.6 Case Studies: SGI Challenge and Sun Enterprise SMPs
6.6.1 SGI Powerpath-2 System Bus
6.6.2 SGI Processor and Memory Subsystems
6.6.3 SGI I/O Subsystem
6.6.4 SGI Challenge Memory System Performance
6.6.5 Sun Gigaplane System Bus
6.6.6 Sun Processor and Memory Subsystem
6.6.7 Sun I/O Subsystem
6.6.8 Sun Enterprise Memory System Performance
6.6.9 Application Performance
6.7 Extending Cache Coherence
6.7.1 Shared-Cache Designs
6.7.2 Coherence for Virtually Indexed Caches
6.7.3 Translation Lookaside Buffer Coherence
6.7.4 Cache coherence on Rings
6.7.5 Scaling Data Bandwidth and Snoop Bandwidth
6.8 Concluding Remarks
6.9 References
6.10 Exercises
7 Scalable Multiprocessors
7.1 Introduction
7.2 Scalability
7.2.1 Bandwidth Scaling
7.2.2 Latency Scaling
7.2.3 Cost Scaling
7.2.4 Physical scaling
7.2.5 Scaling in a Generic Parallel Architecture
7.3 Realizing Programming Models
7.3.1 Primitive Network Transactions
7.3.2 Shared Address Space
7.3.3 Message Passing
7.3.4 Common challenges
7.3.5 Communication architecture design space
7.4 Physical DMA
7.4.1 A Case Study: nCUBE/2
7.5 User-level Access
7.5.1 Case Study: Thinking Machines CM-5
7.5.2 User Level Handlers
7.6 Dedicated Message Processing
7.6.1 Case Study: Intel Paragon
7.6.2 Case Study: Meiko CS-2
7.7 Shared Physical Address Space
7.7.1 Case study: Cray T3D
7.7.2 Cray T3E
7.7.3 Summary
7.8 Clusters and Networks of Workstations
7.8.1 Case Study: Myrinet SBus Lanai
7.8.2 Case Study: PCI Memory Channel
7.9 Comparison of Communication Performance
7.9.1 Network Transaction Performance
7.9.2 Shared Address Space Operations
7.9.3 Message Passing Operations
7.9.4 Application Level Performance
7.10 Synchronization
7.10.1 Algorithms for Locks
7.10.2 Algorithms for Barriers
7.11 Concluding Remarks
7.12 References
7.13 Exercices
8 Directory-based Cache Coherence
8.1 Introduction
8.2 Scalable Cache Coherence
8.3 Overview of Directory-Based Approaches
8.3.1 Operation of a Simple Directory Scheme
8.3.2 Scaling
8.3.3 Alternatives for Organizing Directories
8.4 Assessing Directory Protocols and Tradeoffs
8.4.1 Data Sharing Patterns for Directory Schemes
8.4.2 Local versus Remote Traffic
8.4.3 Cache Block Size Effects
8.5 Design Challenges for Directory Protocols
8.5.1 Performance
8.5.2 Correctness
8.5.3 Overview of Origin2000 Hardware
8.6 Cache-based Directory Protocols: The Sequent NUMA-Q
8.6.1 Cache Coherence Protocol
8.6.2 Dealing with Correctness Issues
8.6.3 Protocol Extensions
8.6.4 Overview of NUMA-Q Hardware
8.6.5 Protocol Interactions with SMP Node
8.6.6 IQ-Link Implementation
8.6.7 Performance Characteristics
8.6.8 Comparison Case Study: The HAL S1 Multiprocessor
8.7 Performance Parameters and Protocol Performance
8.8 Synchronization
8.8.1 Performance of Synchronization Algorithms
8.8.2 Supporting Atomic Primitives
8.9 Implications for Parallel Software
8.10 Advanced Topics
8.10.1 Reducing Directory Storage Overhead
8.10.2 Hierarchical Coherence
8.11 Concluding Remarks
8.12 References
8.13 Exercises
9 Hardware-Software Tradeoffs
9.1 Introduction
9.2 Relaxed Memory Consistency Models　
9.2.1 System Specifications
9.2.2 The Programming Interface
9.2.3 Translation Mechanisms
9.2.4 Consistency Models in Real Systems
9.3 Overcoming Capacity Limitations
9.3.1 Tertiary Caches
9.3.2 Cache-only Memory Architectures (COMA)
9.4 Reducing Hardware Cost
9.4.1 Hardware Access Control with a Decoupled Assist
9.4.2 Access Control through Code Instrumentation
9.4.3 Page-based Access Control: Shared Virtual Memory
9.4.4 Access Control through Language and Compiler Support
9.5 Putting it All Together: A Taxonomy and Simple-COMA
9.5.1 Putting it All Together: Simple-COMA and Stache
9.6 Implications for Parallel Software
9.7 Advanced Topics
9.7.1 Flexibility and Address Constraints in CC-NUMA Systems
9.7.2 Implementing Relaxed Memory Consistency in Software
9.8 Concluding Remarks
9.9 References
9.10 Exercises
10 Interconnection Network Design
10.1 Introduction
10.1.1 Basic definitions
10.1.2 Basic communication performance
10.2 Organizational Structure
10.2.1 Links
10.2.2 Switches
10.2.3 Network Interfaces
10.3 Interconnection Topologies
10.3.1 Fully connected network
10.3.2 Linear arrays and rings
10.3.3 Multidimensional meshes and tori
10.3.4 Trees
10.3.5 Butterflies
10.3.6 Hypercubes
10.4 Evaluating Design Trade-offs in Network Topology
10.4.1 Unloaded Latency
10.4.2 Latency under load
10.5 Routing
10.5.1 Routing Mechanisms
10.5.2 Deterministic Routing
10.5.3 Deadlock Freedom
10.5.4 Virtual Channels
10.5.5 Up
·　-Down
·　Routing
10.5.6 Turn-model Routing
10.5.7 Adaptive Routing
10.6 Switch Design
10.6.1 Ports
10.6.2 Internal Datapath
10.6.3 Channel Buffers
10.6.4 Output Scheduling
10.6.5 Stacked Dimension Switches
10.7 Flow Control
10.7.1 Parallel Computer Networks vs. LANs and WANs
10.7.2 Link-level flow control
10.7.3 End-to-end flow control
10.8 Case Studies
10.8.1 Cray T3D Network
10.8.2 IBM SP-1, SP-2 Network
10.8.3 Scalable Coherent Interconnect
10.8.4 Myricom Network
10.9 Concluding Remarks　
10.10 References
10.11 Exercises
11 Latency Tolerance
11.1 Introduction
11.2 Overview of Latency Tolerance
11.2.1 Latency Tolerance and the Communication Pipeline
11.2.2 Approaches
11.2.3 Fundamental Requirements, Benefits and Limitations
11.3 Latency Tolerance in Explicit Message Passing
11.3.1 Structure of Communication
11.3.2 Block Data Transfer
11.3.3 Precommunication
11.3.4 Proceeding Past Communication in the Same Thread
11.3.5 Multithreading
11.4 Latency Tolerance in a Shared Address Space
11.4.1 Structure of Communication
11.5 Block Data Transfer in a Shared Address Space
11.5.1 Techniques and Mechanisms
11.5.2 Policy Issues and Tradeoffs
11.5.3 Performance Benefits
11.6 Proceeding Past Long-latency Events in a Shared Address Space
11.6.1 Proceeding Past Writes
11.6.2 Proceeding Past Reads
11.6.3 Implementation Issues
11.6.4 Summary
11.7 Precommunication in a Shared Address Space
11.7.1 Shared Address Space With No Caching of Shared Data
11.7.2 Cache-coherent Shared Address Space
11.7.3 Performance Benefits
11.7.4 Implementation Issues
11.7.5 Summary
11.8 Multithreading in a Shared Address Space
11.8.1 Techniques and Mechanisms
11.8.2 Performance Benefits
11.8.3 Implementation Issues for the Blocked Scheme
11.8.4 Implementation Issues for the Interleaved Scheme
11.8.5 Integrating Multithreading with Multiple-Issue Processors
11.9 Lockup-free Cache Design
11.10 Concluding Remarks
11.11 References
11.12 Exercises
12 Future Directions
12.1 Technology and Architecture
12.1.1 Evolutionary Scenario
12.1.2 Hitting a Wall
12.1.3 Potential Breakthroughs
12.2 Applications and System Software
12.2.1 Evolution
12.2.2 Hitting a Wall
12.2.3 Potential Breakthroughs
12.3 References
APPENDIX A Parallel Benchmark Suites
A.1 ScaLapack
A.2 TPC
A.3 SPLASH
A.4 NAS Parallel Benchmarks
A.5 PARKBENCH
A.6 Other Ongoing Efforts