首页>参考读物>计算机科学与技术>软件与程序设计

Spark内核设计的艺术:架构设计与实现
作者 : 耿嘉安 著
丛书名 : 大数据技术丛书
出版日期 : 2017-12-11
ISBN : 978-7-111-58439-1
定价 : 139.00元
扩展资源下载
扩展信息
语种 : 简体中文
页数 : 703
开本 : 16
原书名 :
原出版社:
属性分类: 店面
包含CD : 无CD
绝版 : 未绝版
图书简介

全书分为10章:
第1章介绍Spark学习环境的搭建。通过本章内容的学习,读者能够对Spark有个初步的印象,也能提升学习Spark的信心,为学习后边章节的内容打好基础。
第2章介绍Spark的基本知识和架构。通过本章内容的学习,读者对Spark首先有个宏观的认识,这样在介绍后边章节的内容时可以不断回顾本章内容,促进对各个组件的消化和理解。
第3章介绍Spark内核底层的基础设施。本章的源码将涉及很多读者喜闻乐见的设计模式、程序设计的内容,通过本章内容的学习将有效提升读者阅读本书的兴趣。
第4章介绍SparkContext的初始化。本章的分析内容较为简单,是本书中承上启下的一章。在分析SparkContext的过程中将让读者认识Spark内核中的各种组件。
第5章介绍Spark执行环境SparkEnv。通过对SparkEnv的分析,让读者理解在Spark中凡是需要执行任务的地方就离不开SparkEnv。SparkEnv对后边章节的内容也有一定的引导作用。
第6章介绍Spark自身的存储体系。本章不同于其他书籍中将存储的各个组件视为不同内容的视角,而是将存储体系的内容凝聚起来。
第7章介绍Spark的调度系统。本章将作业调度的内容整合起来,让任务调度的理论更加内聚。读者甚至可以跳过前边的章节,直接阅读本章的内容。读者通过对本章的学习,将对调度算法、DAG调度、任务调度等内容有更深入的认识。
第8章介绍Spark的计算引擎。计算离不开内存,所以本书将Spark中涉及内存管理的内容都纳入本章介绍。通过本章的学习,读者将理解内存池、Tungsten、内存管理器如果管理任务执行所需的内存。
第9章介绍Spark的部署模式。本章之前的内容,为了集中介绍各个章节的主题,都默认采用local模式。本章将带领读者从local模式中跳出来,看到Spark部署模式的总体设计。
第10章介绍Spark的API。Spark从表面上看与之前版本最大的不同是API,所以本章拿个别API的实现作为例子,分析其源码实现,让读者理解新老API之间的异同。

图书特色

本书对Spark内部高度抽象的数据结构RDD、分布式DAG调度器/驱动器,以及高效的基于Non-blocking IO分布式计算框架Akka/Netty等内核设计进行了深度剖析,不可多得,是适合大型分布式计算架构师和资深开源贡献者阅读的参考书。
——蔡栋  万达网络科技集团总裁助理兼首席数据官、首席架构师

Spark发展的短短几年,以迅雷不及掩耳之势推出RDD、Spark Streaming、Spark SQL、GraphX、MLlib等一系列模块,震撼了大数据圈。这本书结合了最新Spark 2.x版本,在设计思路和代码解析上做了很好的平衡,让开源代码爱好者,喜欢研究源码的同学汲取到一些阅读源码的方法。
——董飞 datatist首席运营官、前linkedin资深工程师

大数据的书很多,能够写出艺术味道的不多,本书应该可以让你在大数据漫漫征途之中对价值多了一重思考,也可以让你在大数据之巅的惊天骇浪中多了一座灯塔。
——于俊 科大讯飞大数据专家

本书为那些想要成为一名合格的Spark工程师,或者致力于成为大数据行业的技术管理人才提供了很好的学习途径。相信读者只要掌握一门Spark技术,就能在大数据的海洋中遨游。感谢笔者为大数据产业做出的贡献!
——张涵诚 中关村大数据交易产业联盟副秘书长

本书对Spark原理的讲解与剖析都极具学习意义,作者细致分析了Spark源码的每一个关键细节,对初级用户及中高级用户都有指导意义。
——王欢 上海添锡信息技术有限公司技术总监

内容简介
多位专家联袂推荐,360大数据专家撰写,基于Spark 2.1.0剖析架构与实现精髓。细化到方法级,提炼出多个流程图,立体呈现架构、环境、调度、存储、计算、部署、API七大核心设计。本书一共有10章内容,主要包括以下部分。
准备部分(第1~2章):简单介绍了Spark的环境搭建和基本原理。本部分通过详尽的描述,有效降低了读者进入Spark世界的门槛,同时能对Spark背景知识及整体设计有宏观的认识。
基础部分(第3~5章):介绍Spark的基础设施(包括配置、RPC、度量等)、SparkContext的初始化、Spark执行所需要的环境等内容。经过此部分的学习,将能够对RPC框架的设计、执行环境的功能有深入的理解,这也是对核心内容了解的前提。
核心部分(第6~9章):为Spark最核心的部分,包括存储体系、调度系统、计算引擎、部署模式等。通过本部分的学习,读者将充分了解Spark的数据处理体系细节,能够对Spark核心功能进行扩展、性能优化以及对线上问题进行精准排查。
API部分(第10章):这部分主要对Spark的新老API进行对比,对新API进行简单介绍。

作者简介
耿嘉安
10余年IT行业相关经验。先后就职于阿里巴巴、艺龙、360,专注于开源和大数据领域。在大量的工作实践中,对J2EE、JVM、Tomcat、Spring、Hadoop、Spark、MySQL、Redis都有深入研究,尤其喜欢剖析开源项目的源码实现。早期从事J2EE企业级应用开发,对Java相关技术有独到见解。著有《深入理解Spark:核心思想与源码分析》一书。

图书前言

为什么写这本书
给本书写前言时,让我想起了两年前给《深入理解Spark:核心思想与源码分析》一书写前言的经历。我不禁想起崔护的《题都城南庄》这首诗,诗的内容是:
去年今日此门中,人面桃花相映红。
人面不知何处去,桃花依旧笑春风。
从核心思想和架构来看,Spark依然是那个Spark,但是我已经找了一个新的“东家”。我的年龄不知不觉中又长了两岁,Spark也在大数据领域从“新贵”变成了“老人”。Spark的版本从0.x.x到2.x.x基本上也是用了两年时间。
自从《深入理解Spark:核心思想与源码分析》一书出版后,引起了一些市场反响,更难得的是得到了很多读者的反馈。一些热心的读者通过微信或者邮件向我指出了书中内容的很多不足之处,包括错别字、错误的描述、代码分析有点像流水账、提纲挈领的内容偏少、代码版本过低等。一些错误在修订的版本中得到了解决,有些修正的内容则通过单独写博客来补充。在与读者的沟通过程中,也纠正了我对一些问题的理解偏差。再次深深地感谢广大读者的支持与帮助!
一些读者对《深入理解Spark:核心思想与源码分析》一书的内容非常肯定,希望能够出第2版,高婧雅编辑也一再“怂恿”我,但是我一直没有写第2版的打算。我当时希望有人能够以更好的方式写一本介绍和分析Spark 2.0版本的源码分析书籍,因为我感觉之前的写作方式的确不是很好。在我心中一直有个矛盾:如果源码太少,源码分析的书籍将退化成单纯讲原理的书籍,对于想深入理解Spark实现的读者来说这是不够的;如果源码太多,又让人有堆砌代码或者“混”篇幅的感觉。很多源码分析的书只是简单说说接口或者方法的功能,让人始终有种“雾里看花”的感觉。所以我一直很期待能有更好的方式来写作源码分析类的书。
在一年多的等待中,我始终没有发现类似书籍的出现,于是我打算再做一次尝试。这次摈弃了《深入理解Spark:核心思想与源码分析》一书中按照代码执行流程分析的方式,改为先从整体上介绍一个系统,然后逐个分析每个组件的功能,最后将这些组件之间的关系用流程图的方式串联起来。本书的写作方式依然犯有代码过多的“毛病”,但我还是期待本书能带来一些新的气象。
本书的主要特色
按照源码分析的习惯设计,从脚本分析到初始化,再到核心内容。整个过程遵循由浅入深的基本思路。
每一章先对本章的内容有个总体介绍,然后深入分析各个组件的实现原理,最后将各个组件之间的关系通过执行流程来展现。
本书尽可能地用图来展示原理,以加速读者对内容的掌握。
本书讲解的很多实现及原理都值得借鉴,可以帮助读者提升架构设计、程序设计等方面的能力。
本书尽可能保留较多的源码,以便于初学者能够在脱离办公环境的地方(如地铁、公交等),也能轻松阅读。
读者对象
源码阅读是一项苦差事,人力和时间成本都很高,尤其对于刚刚接触Spark的人来说更是如此。本书尽可能保留源码,使得分析过程不至于产生跳跃感,目的是降低大多数人的学习门槛。如果你是从事IT工作1~3年的新人或者希望开始学习Spark的核心知识,本书非常适合你。如果你已经对Spark有所了解或者已经使用它,还想进一步提高自己,那么本书更适合你。如果你是一个开发新手,对Java、Linux等基础知识还不是很了解的话,本书可能不太适合你。如果你已经对Spark有深入的研究,本书也许可以作为你的参考资料。
总体来说,本书适合以下人群:
已经了解过Spark,但还想深入理解Spark实现原理的人;
大数据技术爱好者;
对性能优化和部署方案感兴趣的运维工程师与架构师;
开源代码爱好者,喜欢研究源码的同学可以通过本书学到一些阅读源码的方式、方法。
本书不会教你如何开发Spark应用程序,而只拿word count的经典例子做演示。本书会简单介绍Hadoop MapReduce、Hadoop YARN、Mesos、Alluxio(Tachyon)、ZooKeeper、HDFS、Akka、Jetty、Netty,但不会过多介绍这些框架的使用,因为市场上已经有丰富的书籍供读者挑选。本书也不会过多介绍Scala、Java、Shell的语法,读者可以在市场上选择适合自己的书籍阅读。本书将无比适合那些想要破解“潘多拉魔盒”的人!
如何阅读本书
本书一共有10章内容,主要包括以下部分。
准备部分(第1~2章):简单介绍了Spark的环境搭建和基本原理,帮助读者了解一些背景知识。
基础部分(第3~5章):介绍Spark的基础设施、SparkContext的初始化、Spark执行环境等内容。
核心部分(第6~9章):这是Spark最为核心的部分,包括存储体系、调度系统、计算引擎、部署模式等。
API部分(第10章):这部分主要对Spark的新旧API进行对比,对新API进行介绍。
本书最后的附录中还包括一些内容:附录A介绍的是Spark中最常用的工具类Utils;附录B是Akka的简介;附录C为Jetty的简介和工具类JettyUtils的介绍;附录D为Metrics库的简介和Metrics中部分API的介绍;附录E演示了Hadoop 1.0版本中的word count例子;附录F 介绍了工具类CommandUtils的常用方法;附录G是关于Netty的简介和工具类NettyUtils的介绍;附录H是对Spark中的RPC工具类RpcUtils的介绍。
为了降低读者阅读理解Spark源码的门槛,本书尽可能保留源码实现。本书以Spark 2.1.0版本为主,有兴趣的读者也可按照本书的方式,阅读Spark的最新源码。
勘误
本书内容很多,限于笔者水平有限,书中内容难免有错误之处。如果你对本书有任何问题或者意见,都可以通过邮箱beliefer@163.com或者博客http://blog.csdn.net/beliefer联系我,给我提交你的建议或者想法,我将怀着一颗谦卑之心与大家共同进步。
致谢
感谢我们生活在信息时代,让我们有机会接触互联网与大数据;感谢父母多年来在学习、工作及生活上的帮助与支持;感谢妻子在生活中的照顾和谦让。
感谢高婧雅编辑给予本书出版的大力支持与帮助。
感谢我在大数据路上的领路人—和仲;感谢热衷于技术的王欢对本书内容提出的宝贵建议;感谢对本书内容进行审阅的余尧尧和马晓波;感谢对本书内容有过帮助的读者朋友们。

耿嘉安

专家评论

当年我在英国从事大数据工作,会经常去硅谷拜访大数据公司。其中最重要一个公司就是Spark创始人创建的Databricks了,最早一次是2013年10月,彼时Databricks刚起步,新办公室也尚在筹备。
4年过去了,我们在大数据、流计算、图计算、分布式机器学习、深度学习等领域有了越来越多的高质量开源选择,但是Spark仍然是数据科学家们用得最多的工具之一,了解一点Spark底层技术的人都不得不对Spark的设计及其分布式计算的理论基础表示由衷敬佩。
本书对Spark内部高度抽象的数据结构RDD、分布式DAG调度器/驱动器,以及高效的基于Non-blocking IO分布式计算框架Akka/Netty等内核设计进行了深度剖析,不可多得,是适合大型分布式计算架构师和资深开源贡献者阅读的参考书。
—蔡栋,万达网络科技集团总裁助理兼首席数据官、首席架构师
大数据技术生态其实是一个千姿百态的江湖。从学习技术的角度,最重要的是能将厚变薄,将纷繁复杂的信息进行归类和抽象。对应到大数据技术体系,虽然各种技术百花齐放,层出不穷,但大数据技术本质上无非解决4个核心问题:存储,计算,查询,挖掘。而Spark发展的短短几年,以迅雷不及掩耳之势推出RDD、Spark Streaming、Spark SQL、GraphX、MLlib等一系列模块,震撼了大数据圈。这本书结合了最新Spark 2.x版本,在设计思路和代码解析上做了很好的平衡,让开源代码爱好者,喜欢研究源码的同学汲取到一些阅读源码的方法。
—董飞,datatist首席运营官、前linkedin资深工程师
初读本书有种似曾相识的感觉,Spark还是那个Spark,但是本书多了一些岁月的痕迹,在技术之上多了一些艺术,也更加注重读者的口味。大数据的书很多,能够写出艺术味道的不多,本书应该可以让你在大数据漫漫征途之中对价值多了一重思考,也可以让你在大数据之巅的惊天骇浪中多了一座灯塔。
—于俊,科大讯飞大数据专家
制度信息化,信息工具化,Spark为大数据产业落地提供有力的技术支撑工具!它以内存计算为核心,以其通用、快速和完整的数据工具形成了一个强有竞争力的数据生态圈,成为大数据技术解决方案非常优秀的一个部分,越来越多企业应用部署Spark。本书为那些想要成为一名合格的Spark工程师,或者致力于成为大数据行业的技术管理人才提供了很好的学习途径。相信读者只要掌握一门Spark技术,就能在大数据的海洋中遨游。感谢笔者为大数据产业做出的贡献!
—张涵诚,中关村大数据交易产业联盟副秘书长
本书对Spark原理的讲解与剖析都极具学习意义,作者细致分析了Spark源码的每一个关键细节,对初级用户及中高级用户都有指导意义。
—王欢,上海添锡信息技术有限公司技术总监

上架指导

计算机科学/大数据分析与处理

封底文字

本书对Spark内部高度抽象的数据结构RDD、分布式DAG调度器/驱动器,以及高效的基于Non-blocking IO分布式计算框架Akka/Netty等内核设计进行了深度剖析,是不可多得的适合大型分布式计算架构师和资深开源贡献者阅读的参考书。
——蔡栋,万达网络科技集团总裁助理兼首席数据官、首席架构师
Spark发展的短短几年,以迅雷不及掩耳之势推出RDD、Spark Streaming、Spark SQL、GraphX、MLlib等一系列模块,震撼了大数据圈。这本书结合了最新Spark 2.x版本,在设计思路和代码解析上做了很好的平衡,让开源代码爱好者,喜欢研究源码的同学汲取到一些阅读源码的方法。
——董飞,datatist首席运营官、前linkedin资深工程师
大数据的书很多,能够写出艺术味道的不多,本书应该可以让你在大数据漫漫征途之中对价值多了一重思考,也可以让你在大数据之巅的惊天骇浪中多了一座灯塔。
——于俊,科大讯飞大数据专家
本书为那些想要成为一名合格的Spark工程师,或者致力于成为大数据行业的技术管理人才提供了很好的学习途径。相信读者只要掌握一门Spark技术,就能在大数据的海洋中遨游。感谢笔者为大数据产业做出的贡献!
——张涵诚,中关村大数据交易产业联盟副秘书长
本书对Spark原理的讲解与剖析都极具学习意义,作者细致分析了Spark源码的每一个关键细节,对初级用户及中高级用户都有指导意义。
——王欢,上海添锡信息技术有限公司技术总监
社区推荐:(放CSDN、掘金logo)

图书目录

本书赞誉
前言
第1章 环境准备 ········································1
1.1 运行环境准备 ···········································2
1.1.1 安装JDK ·········································2
1.1.2 安装Scala ········································2
1.1.3 安装Spark ·······································3
1.2 Spark初体验 ···································4
1.2.1 运行spark-shell ·······························4
1.2.2 执行word count ······························5
1.2.3 剖析spark-shell ·······························9
1.3 阅读环境准备 ·········································14
1.3.1 安装SBT ·······································15
1.3.2 安装Git ·········································15
1.3.3 安装Eclipse Scala IDE插件 ········15
1.4 Spark源码编译与调试 ·························17
1.5 小结 ···························23
第2章 设计理念与基本架构 ···············24
2.1 初识Spark ··································25
2.1.1 Hadoop MRv1的局限···················25
2.1.2 Spark的特点 ·································26
2.1.3 Spark使用场景 ·····························28
2.2 Spark基础知识 ······································29
2.3 Spark基本设计思想 ·····························31
2.3.1 Spark模块设计 ·····························32
2.3.2 Spark模型设计 ·····························34
2.4 Spark基本架构 ···································36
2.5 小结 ·································38
第3章 Spark基础设施 ·························39
3.1 Spark配置 ········································40
3.1.1 系统属性中的配置 ·······················40
3.1.2 使用SparkConf配置的API ·········41
3.1.3 克隆SparkConf配置 ····················42
3.2 Spark内置RPC框架 ····························42
3.2.1 RPC配置TransportConf ··············45
3.2.2 RPC客户端工厂Transport- ClientFactory ·······················47
3.2.3 RPC服务端TransportServer ········53
3.2.4 管道初始化 ···································56
3.2.5 TransportChannelHandler详解 ·····57
3.2.6 服务端RpcHandler详解 ··············63
3.2.7 服务端引导程序Transport-ServerBootstrap ·····················68
3.2.8 客户端TransportClient详解 ········71
3.3 事件总线 ····································78
3.3.1 ListenerBus的继承体系 ···············79
3.3.2 SparkListenerBus详解 ··················80
3.3.3 LiveListenerBus详解 ····················83
3.4 度量系统 ···········································87
3.4.1 Source继承体系 ···························87
3.4.2 Sink继承体系 ·······························89
3.5 小结 ·········································92
第4章 SparkContext的初始化 ·········93
4.1 SparkContext概述 ·································94
4.2 创建Spark环境 ·····································97
4.3 SparkUI的实现 ····································100
4.3.1 SparkUI概述 ·······························100
4.3.2 WebUI框架体系 ·························102
4.3.3 创建SparkUI ·······························107
4.4 创建心跳接收器 ··································111
4.5 创建和启动调度系统··························112
4.6 初始化块管理器BlockManager ·······114
4.7 启动度量系统 ·······························114
4.8 创建事件日志监听器··························115
4.9 创建和启动ExecutorAllocation-Manager ··························116
4.10 ContextCleaner的创建与启动 ········120
4.10.1 创建ContextCleaner ·················120
4.10.2 启动ContextCleaner ·················120
4.11 额外的SparkListener与启动事件总线 ··························122
4.12 Spark环境更新 ··································123
4.13 SparkContext初始化的收尾 ···········127
4.14 SparkContext提供的常用方法 ·······128
4.15 SparkContext的伴生对象················130
4.16 小结 ····································131
第5章 Spark执行环境 ························132
5.1 SparkEnv概述 ·································133
5.2 安全管理器SecurityManager ············133
5.3 RPC环境 ·········································135
5.3.1 RPC端点RpcEndpoint ···············136
5.3.2 RPC端点引用RpcEndpointRef ···139
5.3.3 创建传输上下文TransportConf ···142
5.3.4 消息调度器Dispatcher ···············142
5.3.5 创建传输上下文Transport-Context ·························154
5.3.6 创建传输客户端工厂Transport-ClientFactory ····················159
5.3.7 创建TransportServer ···················160
5.3.8 客户端请求发送 ·························162
5.3.9 NettyRpcEnv中的常用方法 ·······173
5.4 序列化管理器SerializerManager ·····175
5.5 广播管理器BroadcastManager ·········178
5.6 map任务输出跟踪器 ··························185
5.6.1 MapOutputTracker的实现 ··········187
5.6.2 MapOutputTrackerMaster的实现原理 ·······················191
5.7 构建存储体系 ·······································199
5.8 创建度量系统 ·······································201
5.8.1 MetricsCon g详解 ·····················203
5.8.2 MetricsSystem中的常用方法 ····207
5.8.3 启动MetricsSystem ····················209
5.9 输出提交协调器 ··································211
5.9.1 OutputCommitCoordinator-Endpoint的实现 ··················211
5.9.2 OutputCommitCoordinator的实现 ··························212
5.9.3 OutputCommitCoordinator的工作原理 ························216
5.10 创建SparkEnv ····································217
5.11 小结 ·····································217
第6章 存储体系 ·····································219
6.1 存储体系概述 ·······································220
6.1.1 存储体系架构 ·····························220
6.1.2 基本概念 ·····································222
6.2 Block信息管理器 ································227
6.2.1 Block锁的基本概念 ···················227
6.2.2 Block锁的实现 ···························229
6.3 磁盘Block管理器 ······························234
6.3.1 本地目录结构 ·····························234
6.3.2 DiskBlockManager提供的方法 ···························236
6.4 磁盘存储DiskStore ·····························239
6.5 内存管理器 ·····································242
6.5.1 内存池模型 ·································243
6.5.2 StorageMemoryPool详解 ···········244
6.5.3 MemoryManager模型 ················247
6.5.4 Uni edMemoryManager详解 ····250
6.6 内存存储MemoryStore ······················252
6.6.1 MemoryStore的内存模型 ··········253
6.6.2 MemoryStore提供的方法 ··········255
6.7 块管理器BlockManager ····················265
6.7.1 BlockManager的初始化 ·············265
6.7.2 BlockManager提供的方法 ·········266
6.8 BlockManagerMaster对Block-Manager的管理 ·················285
6.8.1 BlockManagerMaster的职责 ······285
6.8.2 BlockManagerMasterEndpoint详解 ·································286
6.8.3 BlockManagerSlaveEndpoint详解 ·····························289
6.9 Block传输服务 ····································290
6.9.1 初始化NettyBlockTransfer-Service ···························291
6.9.2 NettyBlockRpcServer详解 ·········292
6.9.3 Shuf e客户端 ·····························296
6.10 DiskBlockObjectWriter详解 ···········305
6.11 小结 ·······································308
第7章 调度系统 ·····································309
7.1 调度系统概述 ·······································310
7.2 RDD详解 ·····································312
7.2.1 为什么需要RDD ························312
7.2.2 RDD实现的初次分析 ················313
7.2.3 RDD依赖 ····································316
7.2.4 分区计算器Partitioner················318
7.2.5 RDDInfo ······································320
7.3 Stage详解 ········································321
7.3.1 ResultStage的实现 ·····················322
7.3.2 Shuf eMapStage的实现 ·············323
7.3.3 StageInfo ······································324
7.4 面向DAG的调度器DAGScheduler ···326
7.4.1 JobListener与JobWaiter ·············326
7.4.2 ActiveJob详解 ····························328
7.4.3 DAGSchedulerEventProcessLoop的简要介绍 ·······················328
7.4.4 DAGScheduler的组成 ················329
7.4.5 DAGScheduler提供的常用方法 ···330
7.4.6 DAGScheduler与Job的提交 ····334
7.4.7 构建Stage····································337
7.4.8 提交ResultStage ························341
7.4.9 提交还未计算的Task ·················343
7.4.10 DAGScheduler的调度流程 ······347
7.4.11 Task执行结果的处理 ··············348
7.5 调度池Pool ······································351
7.5.1 调度算法 ·······························352
7.5.2 Pool的实现 ·································354
7.5.3 调度池构建器 ·····························357
7.6 任务集合管理器TaskSetManager ···363
7.6.1 Task集合 ·····································363
7.6.2 TaskSetManager的成员属性 ······364
7.6.3 调度池与推断执行 ·····················366
7.6.4 Task本地性 ·································370
7.6.5 TaskSetManager的常用方法 ······373
7.7 运行器后端接口LauncherBackend ···383
7.7.1 BackendConnection的实现 ········384
7.7.2 LauncherBackend的实现 ···········386
7.8 调度后端接口SchedulerBackend ····389
7.8.1 SchedulerBackend的定义 ··········389
7.8.2 LocalSchedulerBackend的实现分析 ································390
7.9 任务结果获取器TaskResultGetter ···394
7.9.1 处理成功的Task ·························394
7.9.2 处理失败的Task ·························396
7.10 任务调度器TaskScheduler ··············397
7.10.1 TaskSchedulerImpl的属性 ·····397
7.10.2 TaskSchedulerImpl的初始化 ···399
7.10.3 TaskSchedulerImpl的启动 ·····399
7.10.4 TaskSchedulerImpl与Task的提交 ·······················400
7.10.5 TaskSchedulerImpl与资源分配 ···························402
7.10.6 TaskSchedulerImpl的调度流程 ······························405
7.10.7 TaskSchedulerImpl对执行结果的处理 ·····························406
7.10.8 TaskSchedulerImpl的常用方法 ···409
7.11 小结 ·······································412
第8章 计算引擎 ·····································413
8.1 计算引擎概述 ·······································414
8.2 内存管理器与执行内存 ·····················417
8.2.1 ExecutionMemoryPool详解 ·······417
8.2.2 MemoryManager模型与执行内存 ··························420
8.2.3 Uni edMemoryManager与执行内存 ·······················421
8.3 内存管理器与Tungsten ·····················423
8.3.1 MemoryBlock详解 ·····················423
8.3.2 MemoryManager模型与Tungsten ···························425
8.3.3 Tungsten的内存分配器 ··············425
8.4 任务内存管理器 ··································431
8.4.1 TaskMemoryManager详解 ·········431
8.4.2 内存消费者 ·······················439
8.4.3 执行内存整体架构 ·····················441
8.5 Task详解 ······································443
8.5.1 任务上下文TaskContext ············443
8.5.2 Task的定义 ·································446
8.5.3 Shuf eMapTask的实现 ··············449
8.5.4 ResultTask的实现 ·······················450
8.6 IndexShuf eBlockResolver详解 ······451
8.7 采样与估算 ···········································455
8.7.1 SizeTracker的实现分析 ·············455
8.7.2 SizeTracker的工作原理 ·············457
8.8 特质WritablePartitionedPair- Collection ······················458
8.9 AppendOnlyMap的实现分析 ···········460
8.9.1 AppendOnlyMap的容量增长 ····461
8.9.2 AppendOnlyMap的数据更新 ····462
8.9.3 AppendOnlyMap的缓存聚合算法 ·····························464
8.9.4 AppendOnlyMap的内置排序 ····466
8.9.5 AppendOnlyMap的扩展 ············467
8.10 PartitionedPairBuffer的实现分析 ···469
8.10.1 PartitionedPairBuffer的容量增长 ······················469
8.10.2 PartitionedPairBuffer的插入 ···470
8.10.3 PartitionedPairBuffer的迭代器 ···471
8.11 外部排序器 ·········································472
8.11.1 ExternalSorter详解 ·················473
8.11.2 Shuf eExternalSorter详解 ······487
8.12 Shuf e管理器 ····································490
8.12.1 Shuf eWriter详解 ··················491
8.12.2 Shuf eBlockFetcherIterator详解 ······························502
8.12.3 BlockStoreShuf eReader详解 ···510
8.12.4 SortShuf eManager详解 ········513
8.13 map端与reduce端的Shuf e组合 ······························516
8.14 小结 ·········································519
第9章 部署模式 ········································520
9.1 心跳接收器HeartbeatReceiver ·········521
9.2 Executor的实现分析 ··························527
9.2.1 Executor的心跳报告 ··················528
9.2.2 运行Task ·····································530
9.3 local部署模式 ······································535
9.4 持久化引擎PersistenceEngine ··········537
9.4.1 基于文件系统的持久化引擎 ·····539
9.4.2 基于ZooKeeper的持久化引擎 ···541
9.5 领导选举代理 ·······································542
9.6 Master详解 ···········································546
9.6.1 启动Master ·································549
9.6.2 检查Worker超时························553
9.6.3 被选举为领导时的处理 ·············554
9.6.4 一级资源调度 ·····························558
9.6.5 注册Worker·································568
9.6.6 更新Worker的最新状态············570
9.6.7 处理Worker的心跳····················570
9.6.8 注册Application··························571
9.6.9 处理Executor的申请 ·················573
9.6.10 处理Executor的状态变化 ·······573
9.6.11 Master的常用方法 ···················574
9.7 Worker详解 ································578
9.7.1 启动Worker·································581
9.7.2 向Master注册Worker ···············584
9.7.3 向Master发送心跳 ····················589
9.7.4 Worker与领导选举·····················591
9.7.5 运行Driver ··································593
9.7.6 运行Executor ······························594
9.7.7 处理Executor的状态变化 ·········599
9.8 StandaloneAppClient实现 ·················600
9.8.1 ClientEndpoint的实现分析 ········601
9.8.2 StandaloneAppClient的实现分析 ······························606
9.9 StandaloneSchedulerBackend的实现分析 ························607
9.9.1 StandaloneSchedulerBackend的属性 ····························607
9.9.2 DriverEndpoint的实现分析 ·······609
9.9.3 StandaloneSchedulerBackend的启动 ··························614
9.9.4 StandaloneSchedulerBackend的停止 ·························617
9.9.5 StandaloneSchedulerBackend与资源分配 ················618
9.10 CoarseGrainedExecutorBackend详解 ····························619
9.10.1 CoarseGrainedExecutorBackend进程 ··························620
9.10.2 CoarseGrainedExecutorBackend的功能分析 ·························622
9.11 local-cluster部署模式 ·······················625
9.11.1 启动本地集群 ····························625
9.11.2 local-cluster部署模式的启动过程 ·································627
9.11.3 local-cluster部署模式下Executor的分配过程 ·················628
9.11.4 local-cluster部署模式下的任务提交执行过程 ····························629
9.12 Standalone部署模式 ·························631
9.12.1 Standalone部署模式的启动过程 ························632
9.12.2 Standalone部署模式下Executor的分配过程 ················634
9.12.3 Standalone部署模式的资源回收 ·····························635
9.12.4 Standalone部署模式的容错机制 ······························636
9.13 其他部署方案 ·····································639
9.13.1 YARN·········································639
9.13.2 Mesos ·········································644
9.14 小结 ·······································646
第10章 Spark API ································647
10.1 基本概念·····································648
10.2 数据源DataSource ····························650
10.2.1 DataSourceRegister详解 ··········650
10.2.2 DataSource详解 ························651
10.3 检查点的实现 ···································655
10.3.1 CheckpointRDD的实现············655
10.3.2 RDDCheckpointData的实现 ····660
10.3.3 ReliableRDDCheckpointData的实现 ························662
10.4 RDD的再次分析 ·······························663
10.4.1 转换API ····································663
10.4.2 动作API ····································665
10.4.3 检查点API的实现分析 ···········667
10.4.4 迭代计算 ···································669
10.5 数据集合Dataset ·······························671
10.6 DataFrameReader详解 ·····················673
10.7 SparkSession详解 ·····························676
10.7.1 SparkSession的构建器Builder ···676
10.7.2 SparkSession的API ·················679
10.8 word count例子 ·································679
10.8.1 Job准备阶段 ·····························680
10.8.2 Job的提交与调度 ·····················685
10.9 小结 ········································689
附录 ···········································690

教学资源推荐
作者: [美]布莱恩· W.克尼汉(Brian W. Kernighan),丹尼斯· M.里奇(Dennis M.Ritchie) 著
作者: John R.Hubbard
作者: [英] 约翰·M. 斯图尔特(John M. Stewart) 著
作者: 郑阿奇 主编 顾韵华 等编著
参考读物推荐