ARM嵌入式系统编程与优化(英文版)
作者 : [美]詹森 D.巴克斯(Jason D. Bakos)著
丛书名 : 经典原版书库
出版日期 : 2017-05-02
ISBN : 978-7-111-56528-4
定价 : 79.00元
教辅资源下载
扩展信息
语种 : 英文
页数 : 312
开本 : 16
原书名 : Embedded Systems: ARM Programming and Optimization
原出版社: Elsevier (Singapore) Pte Ltd
属性分类: 教材
包含CD :
绝版 :
图书简介

现代电子工业得益于两大技术:ARM处理器和Linux操作系统。几乎所有的现代移动设备都是基于ARM处理器的,同时,这些处理器上运行的大多是Linux操作系统。因此,掌握基于ARM和Linux的嵌入式系统设计和开发是非常重要的。本书结合ARM体系结构和Linux工具,讲解程序设计的不同特性将如何对处理器性能产生影响。书中证明,存在一种方法,在不改变程序语义的前提下,允许程序员通过修改代码来实现对性能的重要影响。用于描述和证明这种方法的实际应用包括:图像变换、分形生成、图像卷积和计算机视觉等。本书不仅能够帮助读者理解计算机体系结构和应用设计的基础知识,还提供了设计嵌入式软件的实用技巧。

图书前言

Preface
For many years I have worked in the area of reconfigurable computing, whose goal is to develop tools and methodologies to facilitate the use of field programmable gate arrays(FPGAs) as co-processors for high-performance computer systems.
One of the main challenges in this discipline is the“programming problem,”in which the prac-tical application of FPGAs is fundamentally limited by their tedious and error-prone program-mingmodel. This is of particular concern because this problem is a consequence of the technology’s strengths: FPGAs operate with fine grain concurrency, where the programmer cancontrol the simultaneous behavior of every circuit on the chip.Unfortunately,this control also requires that the programmer manage fine grain constraints such as on-chip memory usage and routing congestion.The CPU programmer, on the other hand,needs only consider the poten-tial state of the CPU at each line of code,while on-chip resources are automatically managed by the hardware at runtime.
I recently realized that modern embedded systems may soon face a simila rprogramming prob-lem. Battery technology continues to remain relatively stagnant,and the slowing of Moore’s Law became painfully evident after the nearly 6-year gap between 65 and 28 nm fabrication technology.At the same time,consumers have come to expect the continued advancement of embedded system capabilities,suchas being able to run real-time augmented reality software on a processor that fits in a pair of eyeglasses.
Given these demands for energy efficiency and performance, many embedded processor ven-dors are seeking more energy-efficient approaches to microarchitecture, often involving target-ing the types of parallelism that cannot be automatically extracted from software. This will require cooperation of the programmers to write parallel code. This is a lot of to ask of program-mers, who will need to juggle both functionality and performance on a resource-and power-constrained platform that includes a wide range of potential sources of parallelism from multi-cores to GPU shader units.
Many universities have developed “unified” parallel programming courses that cover the spec-trum of parallel programming from distributed systems to manycore processors. However, the topic is most often taught from the perspective of high-performance computing as opposed to embedded computing.
With the recent explosion of advanced embedded platforms such as the Raspberry Pi, I saw a need to develop curriculum that combines topics from computer architecture and parallel pro-gramming for performance-oriented programming of embedded systems. I also wanted to include interesting and relevant projects and case studies for the course to avoid the traditional types of dull course projects associated with embedded systems courses (e.g., blink the light) and parallel programming courses (e.g., write and optimize a Fast Fourier Transform).
While using these ideas in my own embedded systems course, and I often find the students com-peting among themselves to achieve the fastest image rotation or the fastest Mandelbrot set gen-erator. This type of collegial competition cultivates excitement for the material.
USING THIS BOOK
This book is intended for use in a junior-or senior-level undergraduate course in a computer science or computer engineering curriculum. Although a course in embedded systems may focus on subtopics such as control theory, robotics, low power design, real-time systems, or other related topics, this book is intended as an introduction to performance-oriented program-ming for lightweight system-on-chip embedded processors.
This book should accompany an embedded design platform such as a Raspberry Pi, on which the student can evaluate the practices and methodologies described.
When using this text, students are expected to know the C programming language, have a basic knowledge of the Linux operating system, and understand basic concurrency such as task synchronization.
INSTRUCTOR SUPPORT
Lecture slides, exercise solutions, and errata are provided at the companion website:
textbooks.elsevier.com/9780128003428

上架指导

计算机/嵌入式

作者简介

[美]詹森 D.巴克斯(Jason D. Bakos)著:
【加照片】詹森 D.巴克斯(Jason D. Bakos) 美国南卡罗来纳大学计算机科学与工程系副教授,研究方向包括高性能计算、异构网络和计算机体系结构等。2009年曾获得美国国家科学基金会(NSF)事业奖,现为ACM会刊《可重构技术与系统》的副主编。

图书目录

Preface ..................................................................... iv
Acknowledgments .................................................................vi
CHAPTER 1 The Linux/ARM embedded platform...................................1
1.1 Performance-Oriented Programming ....................... 3
1.2 ARM Technology ..................................................... 6
1.3 Brief History of ARM .............................................. 7
1.4 ARM Programming .................................................. 8
1.5 ARM Architecture Set Architecture......................... 8
1.5.1 ARM general purpose registers...................... 9
1.5.2 Status register ............................................... 11
1.5.3 Memory addressing modes........................... 12
1.5.4 GNU ARM assembler .................................. 13
1.6 Assembly Optimization #1: Sorting.......................14
1.6.1 Reference implementation............................ 14
1.6.2 Assembly implementation ............................ 15
1.6.3 Result verification......................................... 18
1.6.4 Analysis of compiler-generated code...........21
1.7 Assembly Optimization #2: Bit Manipulation.......22
1.8 Code Optimization Objectives ............................... 25
1.8.1 Reducing the number of executed instructions..................................... 25
1.8.2 Reducing average CPI .................................. 25
1.9 Runtime Profiling with Performance Counters...... 28
1.9.1 ARM performance monitoring unit ............. 28
1.9.2 Linux Perf_Event.......................................... 29
1.9.3 Performance counter infrastructure..............30
1.10 Measuring Memory Bandwidth.............................. 34
1.11 Performance Results ............................................... 37
1.12 Performance Bounds............................................... 38
1.13 Basic ARM Instruction Set .................................... 38
1.13.1 Integer arithmetic instructions.................... 39
1.13.2 Bitwise logical instructions ........................ 39
1.13.3 Shift instructions.........................................39
1.13.4 Movement instructions ............................... 40
1.13.5 Load and store instructions ........................ 40
1.13.6 Comparison instructions ............................. 42
1.13.7 Branch instructions ..................................... 42
1.13.8 Floating-point instructions.......................... 42
1.14 Chapter Wrap-Up ................................................... 44
Exercises .......................................................................... 45
CHAPTER 2 Multicore and data-level optimization: OpenMP and SIMD ............................49
2.1 Optimization Techniques Covered by this Book...50
2.2 Amdahl’s Law ........................................................ 52
2.3 Test Kernel: Polynomial Evaluation ...................... 53
2.4 Using Multiple Cores: OpenMP............................. 55
2.4.1 OpenMP directives ....................................... 56
2.4.2 Scope............................................................. 58
2.4.3 Other OpenMP directives.............................62
2.4.4 OpenMP synchronization ............................. 63
2.4.5 Debugging OpenMP code ............................ 66
2.4.6 The OpenMP parallel for pragma ................ 68
2.4.7 OpenMP with performance counters ........... 70
2.4.8 OpenMP support for the Horner kernel ....... 71
2.5 Performance Bounds............................................... 71
2.6 Performance Analysis.............................................73
2.7 Inline Assembly Language in GCC ....................... 74
2.8 Optimization #1: Reducing Instructions per Flop ................................................. 76
2.9 Optimization #2: Reducing CPI ............................. 79

2.9.1 Software pipelining....................................... 81

2.9.2 Software pipelining Horner’s method..........84

2.10 Optimization #3: Multiple Flops per Instruction with Single Instruction, Multiple Data............. 92
2.10.1 ARM11 VFP short vector instructions....... 94

2.10.2 ARM Cortex NEON instructions ............... 97

2.10.3 NEON intrinsics........................................ 100

2.11 Chapter Wrap-Up ........................................ 101

Exercises ................................................................. 102

CHAPTER 3 Arithmetic optimization and the Linux Framebuffer ..................105
3.1 The Linux Framebuffer .......................................... 106

3.2 Affine Image Transformations ............................... 108

3.3 Bilinear Interpolation.............................................. 110

3.4 Floating-Point Image Transformation....................110

3.4.1 Loading the image ........................................ 113

3.4.2 Rendering frames.......................................... 115

3.5 Analysis of Floating-Point Performance................119

3.6 Fixed-Point Arithmetic ........................................... 120

3.6.1 Fixed point versus floating point: Accuracy ............................................ 121
3.6.2 Fixed point versus floating point: Range ................................................ 121
3.6.3 Fixed point versus floating point: Precision.......................................... 122
3.6.4 Using fixed point .......................................... 123

3.6.5 Efficient fixed-point addition ....................... 123

3.6.6 Efficient fixed-point multiplication.............. 127

3.6.7 Determining radix point position ................. 130
3.6.8 Range and accuracy requirements for image transformation....................................131
3.6.9 Converting from floating-point to fixed-point arithmetic ................................... 132
3.7 Fixed-Point Performance........................................134
3.8 Real-Time Fractal Generation................................134
3.8.1 Pixel coloring................................................ 137
3.8.2 Zooming in ................................................... 138
3.8.3 Range and accuracy requirements................ 139
3.9 Chapter Wrap-Up ................................................... 140
Exercises ........................................................................ 141
CHAPTER 4 Memory optimization and video processing ..................147
4.1 Stencil Loops ........................................................ 148
4.2 Example Stencil: The Mean Filter ....................... 149
4.3 Separable Filters ................................................... 150
4.3.1 Gaussian blur .............................................. 151
4.3.2 The Sobel filter...........................................153
4.3.3 The Harris corner detector ......................... 156
4.3.4 Lucas-Kanade optical flow......................... 158
4.4 Memory Access Behavior of 2D Filters .............. 160
4.4.1 2D data representation................................161
4.4.2 Filtering along the row ............................... 162
4.4.3 Filtering along the column ......................... 163
4.5 Loop Tiling ........................................................... 164
4.6 Tiling and the Stencil Halo Region ..................... 167
4.7 Example 2D Filter Implementation...................... 167
4.8 Capturing and Converting Video Frames ............ 172
4.8.1 YUV and chroma subsampling ................... 172
4.8.2 Exporting tiles to the frame buffer............. 174
4.9 Video4Linux Driver and API...............................176
4.10 Applying the 2D Tiled Filter................................ 181
4.11 Applying the Separated 2D Tiled Filter............... 182
4.12 Top-Level Loop....................................................182
4.13 Performance Results ............................................. 183
4.14 Chapter Wrap-Up ................................................. 184
Exercises ........................................................................ 184
CHAPTER 5 Embedded heterogeneous programming with OpenCL....................................187
5.1 GPU Microarchitecture........................................... 189
5.2 OpenCL................................................................... 190
5.3 OpenCL Programming Model, Idioms, and Abstractions ..................................... 191
5.3.1 The host/device programming model ....................... 191
5.3.2 Error checking .............................................. 192
5.3.3 Platform layer: Initializing the platforms..... 194
5.3.4 Platform layer: Initializing the devices........197
5.3.5 Platform layer: Initializing the context ........ 199
5.3.6 Platform layer: Kernel control ..................... 201
5.3.7 Platform layer: Kernel compilation.............. 202
5.3.8 Platform layer: Device memory allocation..205
5.4 Kernel Workload Distribution................................207
5.4.1 Device memory............................................. 208

5.4.2 Kernel parameters.........................................210

5.4.3 Kernel vectorization ..................................... 213

5.4.4 Parameter space for Horner kernel .............. 214

5.4.5 Kernel attributes ........................................... 216

5.4.6 Kernel dispatch.............................................216

5.5 OpenCL Implementation of Horner’s Method: Device Code .................................. 222
5.5.1 Verification ................................................... 225

5.6 Performance Results ............................................... 227

5.6.1 Parameter exploration................................... 227

5.6.2 Number of workgroups................................. 227

5.6.3 Workgroup size............................................. 228

5.6.4 Vector size .................................................... 229

5.7 Chapter Wrap-Up ................................................... 229

Exercises ........................................................................ 230

Appendix A Adding PMU support to Raspbian for the
Generation 1 Raspberry Pi .........................233
Appendix B NEON intrinsic reference .............................................237
Appendix C OpenCL reference.........................................................253
Index ..............................................................................297

教学资源推荐
作者: [比]保罗•德•格劳威(Paul De Grauwe) 著
作者: [美]弗兰克 J. 法博齐(Frank J. Fabozzi),埃德温 H. 尼夫(Edwin H. Neave),[美]周国富(Guofu Zhou) 著