教学资源 – 图书教辅

首页>教学资源> 计算机科学与技术> 综合

在线试读

数据挖掘：实用机器学习工具与技术（英文版·第4版）

作者 : [新西兰]伊恩 H. 威腾(Ian H. Witten)　埃贝?弗兰克（Eibe Frank）马克 A. 霍尔（Mark A. Hall） [加]克里斯多夫 J. 帕尔（Christopher J. Pal）著

丛书名 : 经典原版书库

出版日期 : 2017-05-03

ISBN : 978-7-111-56527-7

定价 : 129.00元

教学支持联系人

教辅资源下载

扩展信息
图书简介
图书特色
图书前言
上架指导
作者简介
图书目录
图书评论
相关推荐
回到顶部

读者反馈

扩展信息

语种 : 英文

页数 : 652

开本 : 16

原书名 : Data Mining: Practical Machine Learning Tools and Techniques,Fourth Edition

原出版社: Elsevier (Singapore) Pte Ltd

属性分类: 教材

包含CD : 无

绝版 : 无

图书简介

《投资学》是由三名美国知名学府的著名金融学教授撰写的优秀著作，是美国最好的商学院和管理学院的首选教材，在世界各国都有很大的影响，被广泛使用。全书详细讲解了投资领域中的风险组合理论、资本资产定价模型、套利定价理论、市场有效性、证券评估、衍生证券、资产组合管理等重要内容。本书适用于金融专业高年级本科生、研究生及MBA学生，金融领域的研究人员与从业者。

图书特色

假如你需要分析和理解数据，那么本书以及Weka工具包是绝佳的起步。它既是新手必备的教科书，又能让像我这样的专家受益。
—— Jim Gray，1998年图灵奖获得者

本书是数据挖掘和机器学习领域的经典畅销教材，被国内外众多名校选用。第4版全面反映了该领域的最新技术变革，包括关于概率方法和深度学习的重要新章节。此外，备受欢迎的机器学习软件Weka再度升级，读者可以在友好的交互界面中执行数据挖掘任务，通过直观结果加深对算法的理解。
在追踪前沿技术的同时，第4版也继承了之前版本的风格和特色，基础知识清晰详细，实践工具和技术指导具体实用。从准备输入、解释输出和评估结果，到数据挖掘的核心算法，无一不得到了简洁而优雅的呈现。
作者简介
伊恩 H. 威腾（Ian H. Witten）新西兰怀卡托大学计算机科学系教授，ACM 会士，新西兰皇家学会会士，曾荣获2004年国际信息处理研究协会（IFIP）颁发的Namur奖。
埃贝·弗兰克（Eibe Frank）新西兰怀卡托大学计算机科学系副教授，因Weka软件的成功而与Witten及Hall一道获得了2005年ACM SIGKDD服务奖。
马克 A. 霍尔（Mark A. Hall）新西兰怀卡托大学名誉副研究员，Weka软件的核心开发者。
克里斯多夫 J. 帕尔（Christopher J. Pal）蒙特利尔工程学院副教授，研究方向包括人工智能、计算机视觉和模式识别等。

图书前言

The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expec-tations, that underlie the data. There is a huge amount of information locked up in databases—information that is potentially important but has not yet been discov-ered or articulated. Our mission is to bring it forth.
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts will be garbled, some miss-ing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.
Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases—information i.e., ideally, expressed in a comprehensible form and can be used for a variety of purposes. The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning that are used in practical data mining for finding, and if possible describing, structural patterns in data.
As with any burgeoning new technology that enjoys intense commercial atten-tion, the use of machine learning is surrounded by a great deal of hype in the technical—and sometimes the popular—press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techni-ques and shows how they work.
In many applications machine learning enables the acquisition of structural descriptions from examples. The kind of descriptions that are found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally—perhaps more—interested in applications where the result of “learning” is an actual description of a structure that can be used to classify examples. This structural description supports explanation and under-standing as well as prediction. In our experience, insights gained by the user are
of most interest in the majority of practical data mining applications; indeed, this is one of machine learning’s major advantages over classical statistical modeling.
The book explains a wide variety of machine learning methods. Some are ped-agogically motivated: simple schemes that are designed to explain clearly how the basic ideas work. Others are practical: real systems that are used in applica-tions today. Many are contemporary and have been developed only in the last few years.
A comprehensive software resource has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or WEKA1 for short, it is available as Java source code at www.cs.waikato.ac.nz/ml/weka. It is a full, industrial-strength implementation of most of the techniques that are covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers clean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.
The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability. If you wish to experiment with your own data, you will be able to do this easily with the WEKA software. But WEKA is by no means the only choice. For example, the freely available statistical computing environment R includes many machine learning algorithms. Devotees of the Python programming language might look at a popular library called scikit-learn. Modern “big data” frameworks for distrib-uted computing, such as Apache Spark, include support for machine learning. There is a plethora of options for deploying machine learning in practice. This book discusses fundamental learning algorithms without delving into software-specific implementation details. When appropriate, we point out where the algorithms we discuss can be found in the WEKA software. We also briefly introduce other machine learning software for so-called “deep learning” from high-dimensional data. However, most software-specific information is relegated to appendices.
The book spans the gulf between the intensely practical approach taken by trade books that provide case studies on data mining and the more theoretical, principle-driven exposition found in current textbooks on machine learning. (A brief description of these books appears in the Further reading section at the end of chapter: What’s it all about ) This gulf is rather wide. To apply machine learning techniques productively, you need to understand something about how they work; this is not a technology that you can apply blindly and expect to get good results. Different problems yield to different techniques, but it is rarely obvi-ous which techniques are suitable for a given situation: you need to know some-thing about the range of possible solutions. And we cover an extremely wide range of techniques. We can do this because, unlike many trade books, this vol-ume does not promote any particular commercial software or approach. We include a large number of examples, but they use illustrative datasets that are small enough to allow you to follow what is going on. Real datasets are far too large to show this (and in any case are usually company confidential). Our data-sets are chosen not to illustrate actual large-scale practical problems, but to help you understand what the different techniques do, how they work, and what their range of application is.

Found only on the islands of New Zealand, the weka (pronounced to rhyme with “Mecca”) is a flightless bird with an inquisitive nature.
The book is aimed at the technically aware general reader who is interested in the principles and ideas underlying the current practice of machine learning. It will also be of interest to information professionals who need to become acquainted with this new technology, and to all those who wish to gain a detailed technical understanding of what machine learning involves. It is written for an eclectic audience of information systems practitioners, programmers, consultants, developers, data scientists, information technology managers, specification wri-ters, patent examiners, curious lay people—as well as students and professors— who need an easy-to-read book with lots of illustrations that describes what the major machine learning techniques are, what they do, how they are used, and how they work. It is practically oriented, with a strong “how to” flavor, and includes algorithms, and often pseudo-code. All those involved in practical data mining will benefit directly from the techniques described. The book is aimed at people who want to cut through to the reality that underlies the hype about machine learning and who seek a practical, nonacademic, unpretentious approach. In most of the book we have avoided requiring any specific theoretical or mathe-matical knowledge. However, recognizing the growing complexity of the subject as it matures, we have included substantial theoretical material in Chapter 9, Probabilistic methods, and Chapter 10, Deep learning, because this is necessary for a full appreciation of recent practical techniques, in particular deep learning.
The book is organized in layers that make the ideas accessible to readers who are interested in grasping the basics, as well as to those who would like more depth of treatment, along with full details on the techniques covered. We believe that consumers of machine learning need to have some idea of how the algorithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all users to have a deep understand-ing of the finer details of the algorithms.
We address this situation by describing machine learning methods at succes-sive levels of detail. The book is divided into two parts. Part I is an introduction to machine learning for data mining. The reader will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter 1, What’s it all about , describes, through examples, what machine learning is, where it can be used; it also provides actual practical applications. Chapter 2, Input: concepts, instances, attributes, and Chapter 3, Output: knowledge representation, cover the different kinds of input and output—or knowledge representation—that are involved. Different kinds of output dictate different styles of algorithm, and Chapter 4, Algorithms: the basic methods, describes the basic methods of machine learning, simplified to make them easy to comprehend. Here the principles involved are conveyed in a variety of algorithms without getting involved in intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing. Chapter 5, Credibility: evaluating what’s been learned, which can be read out of sequence, equips the reader to evaluate the results that are obtained from machine learning, addressing the sometimes com-plex issues involved in performance evaluation.
Part II introduces advanced techniques of machine learning for data mining. At the lowest and most detailed level, Chapter 6, Trees and rules, and Chapter 7, Extending instance-based and linear models, expose in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms, including the complexities that are necessary for them to work well in practice (but omitting the heavy mathematical machinery that is required for a few of the algorithms). Although many readers may want to ignore such detailed informa-tion, it is at this level that full working implementations of machine learning schemes are written. Chapter 8, Data transformations, describes practical topics involved with engineering the input and output to machine learning—e.g., select-ing and discretizing attributes. Chapter 9, Probabilistic methods, and Chapter 10, Deep learning, provide a rigorous account of probabilistic methods for machine learning and deep learning respectively. Chapter 11, Beyond supervised and unsu-pervised learning, looks at semisupervised and multi-instance learning, while Chapter 12, Ensemble learning, covers techniques of “ensemble learning,” which combine the output from different learning techniques. Chapter 13, Moving on: applications and beyond, looks to the future.
The book describes most methods used in practical machine learning. However, it does not cover reinforcement learning because it is rarely applied in practical data mining; nor genetic algorithm approaches because these are really just optimization techniques that are not specific to machine learning; nor rela-tional learning and inductive logic programming because they are not very com-monly used in mainstream data mining applications.
An Appendix covers some mathematical background needed to follow the material in Chapter 9, Probabilistic methods, and Chapter 10, Deep learning. Another Appendix introduces the WEKA data mining workbench, which provides implementations of most of the ideas described in Parts I and II. We have done this in order to clearly separate conceptual material from the practical aspects of how to use it. At the end of each chapter in Parts I and II are pointers to related WEKA algorithms. You can ignore these, or look at them as you go along, or skip directly to the WEKA material if you are in a hurry to get on with analyzing your data and don’t want to be bothered with the technical details of how the algorithms work.
UPDATED AND REVISED CONTENT
We finished writing the first edition of this book in 1999, the second and third in 2005 and 2011 respectively, and now, in May 2016, are just polishing this fourth edition. How things have changed over the past couple of decades! While the basic core of material remains the same, we have made the most of opportunities to update it and add new material, and as a result the book has doubled in size to reflect the changes that have taken place. Of course, there have also been errors to fix, errors that we had accumulated in our publicly available errata file (avail-able through the book’s home page at http://www.cs.waikato.ac.nz/ml/weka/book. html).
SECOND EDITION
The major change in the second edition of the book was a separate part at the end of the book that included all the material on the WEKA machine learning work-bench. This allowed the main part of the book to stand alone, independent of the workbench. At that time WEKA, a widely used and popular feature of the first edition, had just acquired a radical new look in the form of an interactive graphi-cal user interface—or rather, three separate interactive interfaces—which made it far easier to use. The primary one is the “Explorer,” which gives access to all of WEKA’s facilities using menu selection and form filling. The others are the Knowledge Flow interface, which allows you to design configurations for streamed data processing, and the Experimenter, with which you set up automated experiments that run selected machine learning algorithms with different parame-ter settings on a corpus of datasets, collect performance statistics, and perform significance tests on the results. These interfaces lower the bar for becoming a practitioner of machine learning, and the second edition included a full descrip-tion of how to use them.
It also contained much new material that we briefly mention here. We extended the sections on rule learning and cost-sensitive evaluation. Bowing to popular demand, we added information on neural networks: the perceptron and the closely related Winnow algorithm; the multilayer perceptron and backpropa-gation algorithm. Logistic regression was also included. We described how to implement nonlinear decision boundaries using both the kernel perceptron and radial basis function networks, and also included support vector machines for regression. We incorporated a new section on Bayesian networks, again in response to readers’ requests and WEKA’s new capabilities in this regard, with a description of how to learn classifiers based on these networks, and how to imple-ment them efficiently using AD trees.
The previous 5 years (1999.2004) had seen great interest in data mining for text, and this was reflected in the introduction of string attributes in WEKA, mul-tinomial Bayes for document classification, and text transformations. We also described efficient data structures for searching the instance space: kD-trees and ball trees for finding nearest neighbors efficiently, and for accelerating distance-based clustering. We described new attribute selection schemes such as race search and the use of support vector machines; new methods for combining mod-els such as additive regression, additive logistic regression, logistic model trees, and option trees. We also covered recent developments in using unlabeled data to improve classification, including the cotraining and co-EM methods.
THIRD EDITION
For the third edition, we thoroughly edited the second edition and brought it up to date, including a great many new methods and algorithms. WEKA and the book were closely linked together—pretty well everything in WEKA was covered in the book. We also included far more references to the literature, practically tri-pling the number of references that were in the first edition.
As well as becoming far easier to use, WEKA had grown beyond recognition over the previous decade, and matured enormously in its data mining capabilities. It incorporates an unparalleled range of machine learning algorithms and related techniques. The growth has been partly stimulated by recent developments in the field, and is partly user-led and demand-driven. This puts us in a position where we know a lot about what actual users of data mining want, and we have capital-ized on this experience when deciding what to include in this book.
Here are a few of the highlights of the material that was added in the third edi-tion. A section on web mining was included, and, under ethics, a discussion of how individuals can often be “reidentified” from supposedly anonymized data. Other additions included techniques for multi-instance learning, new material on interactive cost-benefit analysis, cost-complexity pruning, advanced association rule algorithms that use extended prefix trees to store a compressed version of the dataset in main memory, kernel ridge regression, stochastic gradient descent, and hierarchical clustering methods. We added new data transformations: partial least squares regression, reservoir sampling, one-class learning, decomposing multi-class classification problems into ensembles of nested dichotomies, and calibrat-ing class probabilities. We added new information on ensemble learning techniques: randomization vs. bagging, and rotation forests. New sections on data stream learning and web mining were added as well.
FOURTH EDITION
One of the main drivers behind this fourth edition was a desire to add comprehen-sive material on the topic of deep learning, a new development that is essentially enabled by the emergence of truly vast data resources in domains like image and speech processing, and the availability of truly vast computational resources, including server farms and graphics processing units. However, deep learning techniques are heavily based on a potent mix of theory and practice. And we had also received other requests asking us to include more, and more rigorous, theo-retical material.
This forced us to rethink the role of theory in the book. We bit the bullet and added two new theoretically oriented chapters. Chapter 10, Deep learning, covers deep learning itself, and its predecessor, Chapter 9, Probabilistic methods, gives a principled theoretical development of probabilistic methods that is necessary to understand a host of other new algorithms. We recognize that many of our readers will not want to stomach all this theory, and we assure them that the remainder of the book has intentionally been left at a far simpler mathematical level. But this additional theoretical base puts some key material in the hands of readers who aspire to understand rapidly advancing techniques from the research world.
Developments in WEKA have proceeded apace. It now provides ways of reaching out and incorporating other languages and systems, such as the popular R statistical computing language, the Spark and Hadoop frameworks for distrib-uted computing, the Python and Groovy languages for scripting, and the MOA system for stream-oriented learning—to name but a few. Recognizing that it is not possible, and perhaps not desirable, to document such a comprehensive and fast-evolving system in a printed book, we have created a series of open online courses, Data Mining with Weka, More Data Mining with Weka, and Advanced Data Mining with Weka, to accompany the book (at https://weka.waikato.ac.nz).
The fourth edition contains numerous other updates and additions, and far more references to the literature. But enough of this: dive in and see for yourself.
ACKNOWLEDGMENTS
Writing the acknowledgments is always the nicest part! A lot of people have helped us, and we relish this opportunity to thank them. This book has arisen out of the machine learning research project in the Computer Science Department at the University of Waikato, New Zealand. We have received generous encouragement and assistance from the academic staff members early on in that project: John Cleary, Sally Jo Cunningham, Matt Humphrey, Lyn Hunt, Bob McQueen, Lloyd Smith, and Tony Smith. We also benefited greatly from interactions with staff members who arrived later: Michael Mayo and Robert Durrant. Special thanks go to Geoff Holmes, who led the project for many years, and Bernhard Pfahringer, who had significant input into many different aspects of the WEKA software. All who have worked on the machine learning project here have con-tributed to our thinking: we would particularly like to mention early students Steve Garner, Stuart Inglis and Craig Nevill-Manning for helping us to get the project off the ground in the beginning when success was less certain and things were more difficult.
The WEKA system that illustrates the ideas in this book forms a crucial component of it. It was conceived by the authors and designed and implemented principally by Eibe Frank, Mark Hall, Peter Reutemann, and Len Trigg, but many people in the machine learn-ing laboratory at Waikato made significant contributions. Since the first edition of the book the WEKA team has expanded considerably: so many people have contributed that it is impossible to acknowledge everyone properly. We are grateful to Chris Beckham, for con-tributing several packages to WEKA, Remco Bouckaert for his Bayes net package and many other contributions, Lin Dong for her implementations of multi-instance learning methods, Dale Fletcher for many database-related aspects, Kurt Driessens for his imple-mentation of Gaussian process regression, James Foulds for his work on multi-instance fil-tering, Anna Huang for information bottleneck clustering, Martin Gu¨tlein for his work on feature selection, Kathryn Hempstalk for her one-class classifier, Ashraf Kibriya and Richard Kirkby for contributions far too numerous to list, Nikhil Kishore for his implemen-tation of elastic net regression, Niels Landwehr for logistic model trees, Chi-Chung Lau for creating all the icons for the Knowledge Flow interface, Abdelaziz Mahoui for the imple-mentation of K., Jonathan Miles for his implementation of kernel filtering, Stefan Mutter for association rule mining, Malcolm Ware for numerous miscellaneous contributions, Haijian Shi for his implementations of tree learners, Marc Sumner for his work on speeding up logistic model trees, Tony Voyle for least-median-of-squares regression, Yong Wang for Pace regression and the original implementation of M50, Benjamin Weber for his great unification of WEKA parsing modules, and Xin Xu for his multi-instance learning pack-age, JRip, logistic regression and many other contributions. Our sincere thanks go to all these people for their dedicated work, and also to the many contributors to WEKA from outside our group at Waikato.
Tucked away as we are in a remote (but very pretty) corner of the southern hemisphere, we greatly appreciate the visitors to our department who play a crucial role in acting as sounding boards and helping us to develop our thinking. We would like to mention in par-ticular Rob Holte, Carl Gutwin, and Russell Beale, each of whom visited us for several months; David Aha, who although he only came for a few days did so at an early and frag-ile stage of the project and performed a great service by his enthusiasm and encourage-ment; and Kai Ming Ting, who worked with us for 2 years on many of the topics in this book, and helped to bring us into the mainstream of machine learning. More recent visitors included Arie Ben-David, Carla Brodley, Gregory Butler, Stefan Kramer, Johannes Schneider, Jan van Rijn and Michalis Vlachos, and many others who have given talks at our department. We would particularly like to thank Albert Bifet, who gave us detailed feedback on a draft version of the third edition—most of which we have incorporated.
Students at Waikato have played a significant role in the development of the project. Many of them are in the above list of WEKA contributors, but they have also contributed in other ways. In the early days, Jamie Littin worked on ripple-down rules and relational learning. Brent Martin explored instance-based learning and nested instance-based repre-sentations. Murray Fife slaved over relational learning, Nadeeka Madapathage investigated the use of functional languages for expressing machine learning algorithms. Kathryn Hempstalk worked on one-class learning and Richard Kirkby on data streams. Gabi Schmidberger worked on density estimation trees, Lan Huang on concept-based text clus-tering, and Alyona Medelyan on keyphrase extraction. More recently, Felipe Bravo has worked on sentiment classification for Twitter, Mi Li on fast clustering methods, and Tim Leathart on ensembles of nested dichotomies. Other graduate students have influenced us in numerous ways, particularly Gordon Paynter, YingYing Wen, and Zane Bray, who have worked with us on text mining, and Quan Sun and Xiaofeng Yu. Colleagues Steve Jones and Malika Mahoui have also made far-reaching contributions to these and other machine learning projects. We have also learned much from our many visiting students from Freiburg, including Nils Weidmann.
Ian Witten would like to acknowledge the formative role of his former students at Calgary, particularly Brent Krawchuk, Dave Maulsby, Thong Phan, and Tanja Mitrovic, all of whom helped him develop his early ideas in machine learning, as did faculty members Bruce MacDonald, Brian Gaines, and David Hill at Calgary, and John Andreae at the University of Canterbury.
Eibe Frank is indebted to his former supervisor at the University of Karlsruhe, Klaus-Peter Huber, who infected him with the fascination of machines that learn. On his travels Eibe has benefited from interactions with Peter Turney, Joel Martin, and Berry de Bruijn in Canada; Luc de Raedt, Christoph Helma, Kristian Kersting, Stefan Kramer, Ulrich Ru¨ckert, and Ashwin Srinivasan in Germany.
Mark Hall thanks his former supervisor Lloyd Smith, now at Missouri State University, who exhibited the patience of Job when his thesis drifted from its original topic into the realms of machine learning. The many and varied people who have been part of, or have visited, the machine learning group at the University of Waikato over the years deserve a special thanks for their valuable insights and stimulating discussions.
Chris Pal thanks his coauthors for the invitation to help write this fourth edition; and his family for accommodating the extra time spent writing. He thanks Polytechique Montre′al for the sabbatical that allowed him to travel to New Zealand, leading to this fruit-ful collaboration; and the University of Waikato Department of Computer Science for host-ing him. He also thanks his many mentors, academic family, coauthors, and colleagues over the years, whose perspectives have been enriching and have influenced the presenta-tion here, including: Brendan Frey, Geoff Hinton, Yoshua Bengio, Sam Roweis, Andrew McCallum, and Charles Sutton among many others. Particular thanks goes to Hugo Larochelle for his pedagogical perspectives on deep learning. Chris also thanks his friends and colleagues at the Montre′al Institute for Learning Algorithms, the Theano development team and all his current and former students for helping to create such a great environment for machine learning research. Particular thanks goes to Chris Beckham who provided excellent feedback on earlier drafts of the new chapters in this edition.
Charlie Kent and Tim Pitts of Morgan Kaufmann have worked hard to shape this book, and Nicky Carter, our production editor, has made the process go very smoothly. We would like to thank the librarians of the Repository of Machine Learning Databases at the University of California, Irvine, whose carefully collected datasets have been invaluable in our research.
Our research has been funded by the New Zealand Foundation for Research, Science and Technology and the Royal Society of New Zealand Marsden Fund. The Department of Computer Science at the University of Waikato has generously supported us in all sorts of ways, and we owe a particular debt of gratitude to Mark Apperley for his enlightened lead-ership and warm encouragement. Part of the first edition was written while both authors were visiting the University of Calgary, Canada, and the support of the Computer Science department there is gratefully acknowledged—as well as the positive and helpful attitude of the long-suffering students in the machine learning course, on whom we experimented. Part of the second edition was written at the University of Lethbridge in Southern Alberta on a visit supported by Canada’s Informatics Circle of Research Excellence.
Last, and most of all, we are grateful to our families and partners. Pam, Anna and Nikki were all too well aware of the implications of having an author in the house (“not again!”) but let Ian go ahead and write the book anyway. Julie was always support-ive, even when Eibe had to burn the midnight oil in the machine learning lab, and Immo and Ollig provided exciting diversions. Bernadette too was very supportive, somehow managing to keep the combined noise output of Charlotte, Luke, Zach, Kyle, and Francesca to a level that allowed Mark to concentrate. Between us we hail from Canada, England, Germany, Ireland, New Zealand, and Samoa: New Zealand has brought us together and provided an ideal, even idyllic, place to do this work.

上架指导

计算机\数据挖掘

作者简介

[新西兰]伊恩 H. 威腾(Ian H. Witten)　埃贝?弗兰克（Eibe Frank）马克 A. 霍尔（Mark A. Hall） [加]克里斯多夫 J. 帕尔（Christopher J. Pal）著：
伊恩 H. 威腾(Ian H. Witten)　新西兰怀卡托大学计算机科学系教授，ACM 会士，新西兰皇家学会会士，曾荣获2004年国际信息处理研究协会（IFIP）颁发的Namur奖。
埃贝·弗兰克（Eibe Frank）新西兰怀卡托大学计算机科学系副教授，因Weka软件的成功而与Witten及Hall一道获得了2005年ACM SIGKDD服务奖。
马克 A. 霍尔（Mark A. Hall）新西兰怀卡托大学名誉副研究员，Weka软件的核心开发者。
克里斯多夫 J. 帕尔（Christopher J. Pal）蒙特利尔工程学院副教授，研究方向包括人工智能、计算机视觉和模式识别等。

图书目录

Preface ............................................................ iv

PART I INTRODUCTION TO DATA MINING
CHAPTER 1 What’s it all about ..................................................... 3
1.1 Data Mining and Machine Learning..............................................4

Describing Structural Patterns....................................................... 6

Machine Learning.......................................................................... 7

Data Mining ................................................................................... 9

1.2 Simple Examples: The Weather Problem and Others...................9

The Weather Problem.................................................................. 10

Contact Lenses: An Idealized Problem....................................... 12

Irises: A Classic Numeric Dataset .............................................. 14

CPU Performance: Introducing Numeric Prediction .................. 16

Labor Negotiations: A More Realistic Example......................... 16

Soybean Classification: A Classic Machine Learning
Success ............................................19

1.3 Fielded Applications ....................................................................21

Web Mining ................................................................................. 21

Decisions Involving Judgment .................................................... 22

Screening Images......................................................................... 23

Load Forecasting ......................................................................... 24
Diagnosis...................................................................................... 25

Marketing and Sales .................................................................... 26

Other Applications....................................................................... 27

1.4 The Data Mining Process.............................................................28

1.5 Machine Learning and Statistics..................................................30

1.6 Generalization as Search..............................................................31

Enumerating the Concept Space ................................................. 32

Bias ............................................................................... 33

1.7 Data Mining and Ethics...............................................................35
Reidentification............................................................................ 36

Using Personal Information......................................................... 37

Wider Issues................................................................................. 38

1.8 Further Reading and Bibliographic Notes...................................38

CHAPTER 2 Input: concepts, instances, attributes......................... 43

2.1 What’s a Concept ......... ............... ...............................................44

2.2 What’s in an Example ................. ...............................................46

Relations ............. .............................. ........................................... 47

Other Example Types ............... ............................ ....................... 51

2.3 What’s in an Attribute ................. ...............................................53

2.4 Preparing the Input......... ............... ...............................................56

Gathering the Data Together ............... ........................................ 56

ARFF Format .................................... ........................................... 57

Sparse Data ......... .............................. ........................................... 60

Attribute Types ................................. ........................................... 61

Missing Values ................................. ........................................... 62

Inaccurate Values.............................. ........................................... 63

Unbalanced Data............................... ........................................... 64

Getting to Know Your Data ............... ......................................... 65

2.5 Further Reading and Bibliographic Notes ...................................65

CHAPTER 3 Output: knowledge representation ............................... 67

3.1 Tables ............................. ............... ......................................68

3.2 Linear Models ............................... .........................................68

3.3 Trees .............................................. .....................................70

3.4 Rules .............................................. .......................................75

Classification Rules ............... ...................................................... 75

Association Rules .............................. .......................................... 79

Rules With Exceptions ............... .......................... ....................... 80

More Expressive Rules ............... .......................... ....................... 82

3.5 Instance-Based Representation ....................................................84

3.6 Clusters........................... ............... .............................87

3.7 Further Reading and Bibliographic Notes ...................................88

CHAPTER 4 Algorithms: the basic methods ..................................... 91

4.1 Inferring Rudimentary Rules .......................................................93

Missing Values and Numeric Attributes .............. ....................... 94

4.2 Simple Probabilistic Modeling ....................................................96

Missing Values and Numeric Attributes .............. ..................... 100

Naive Bayes for Document Classification ........... ..................... 103

Remarks .............. .............................. ......................................... 105

4.3 Divide-and-Conquer: Constructing Decision Trees ..................105

Calculating Information............... ......................... ..................... 108

Highly Branching Attributes ..................................................... 110

4.4 Covering Algorithms: Constructing Rules..............................113

Rules Versus Trees .................................................................. 114

A Simple Covering Algorithm ................................................ 115

Rules Versus Decision Lists....................................................119

4.5 Mining Association Rules........................................................120

Item Sets .................................................................................. 120

Association Rules .................................................................... 122

Generating Rules Efficiently ................................................... 124

4.6 Linear Models..........................................................................128

Numeric Prediction: Linear Regression .................................. 128

Linear Classification: Logistic Regression ............................. 129

Linear Classification Using the Perceptron ............................ 131

Linear Classification Using Winnow ...................................... 133

4.7 Instance-Based Learning..........................................................135

The Distance Function............................................................. 135

Finding Nearest Neighbors Efficiently ................................... 136

Remarks ........................................................................... 141

4.8 Clustering .................................................................141

Iterative Distance-Based Clustering........................................142

Faster Distance Calculations ................................................... 144

Choosing the Number of Clusters...........................................146

Hierarchical Clustering............................................................147

Example of Hierarchical Clustering........................................148

Incremental Clustering............................................................. 150

Category Utility ....................................................................... 154

Remarks ................................................................................... 156

4.9 Multi-instance Learning...........................................................156

Aggregating the Input..............................................................157

Aggregating the Output ........................................................... 157

4.10 Further Reading and Bibliographic Notes...............................158

4.11 WEKA Implementations..........................................................160

CHAPTER 5 Credibility: evaluating what’s been learned............. 161

5.1 Training and Testing..................................................................163

5.2 Predicting Performance..............................................................165

5.3 Cross-Validation.........................................................................167

5.4 Other Estimates..........................................................................169

Leave-One-Out .......................................................................... 169

The Bootstrap............................................................................. 169

5.5 Hyperparameter Selection..........................................................171

5.6 Comparing Data Mining Schemes...........................................172

5.7 Predicting Probabilities............................................................176

Quadratic Loss Function.......................................................... 177

Informational Loss Function ................................................... 178

Remarks ................................................................................... 179

5.8 Counting the Cost ....................................................................179

Cost-Sensitive Classification ................................................... 182

Cost-Sensitive Learning........................................................... 183

Lift Charts................................................................................183

ROC Curves.............................................................................186

Recall-Precision Curves........................................................... 190

Remarks ................................................................................... 190

Cost Curves..............................................................................192

5.9 Evaluating Numeric Prediction ...............................................194

5.10 The MDL Principle..................................................................197

5.11 Applying the MDL Principle to Clustering.............................200

5.12 Using a Validation Set for Model Selection...........................201

5.13 Further Reading and Bibliographic Notes...............................202

PART II MORE ADVANCED MACHINE LEARNING SCHEMES
CHAPTER 6 Trees and rules ........................................... 209
6.1 Decision Trees............................................................................210

Numeric Attributes .................................................................... 210

Missing Values .......................................................................... 212
Pruning....................................................................................... 213

Estimating Error Rates .............................................................. 215

Complexity of Decision Tree Induction.................................... 217

From Trees to Rules .................................................................. 219

C4.5: Choices and Options........................................................ 219

Cost-Complexity Pruning .......................................................... 220

Discussion .................................................................................. 221

6.2 Classification Rules....................................................................221

Criteria for Choosing Tests ....................................................... 222

Missing Values, Numeric Attributes......................................... 223

Generating Good Rules ............................................................. 224

Using Global Optimization........................................................ 226

Obtaining Rules From Partial Decision Trees .......................... 227

Rules With Exceptions .............................................................. 231

Discussion .................................................................................. 233

6.3 Association Rules.......................................................................234

Building a Frequent Pattern Tree .............................................. 235

Finding Large Item Sets ............................................................ 240

Discussion .................................................................................. 241

6.4 WEKA Implementations............................................................242

CHAPTER 7 Extending instance-based and linear models .......... 243

7.1 Instance-Based Learning............................................................244

Reducing the Number of Exemplars......................................... 245

Pruning Noisy Exemplars.......................................................... 245

Weighting Attributes ................................................................. 246

Generalizing Exemplars............................................................. 247

Distance Functions for Generalized Exemplars........................ 248

Generalized Distance Functions ................................................ 250

Discussion .................................................................................. 250

7.2 Extending Linear Models...........................................................252

The Maximum Margin Hyperplane........................................... 253

Nonlinear Class Boundaries ...................................................... 254

Support Vector Regression........................................................ 256

Kernel Ridge Regression........................................................... 258

The Kernel Perceptron............................................................... 260

Multilayer Perceptrons............................................................... 261

Radial Basis Function Networks ............................................... 270

Stochastic Gradient Descent...................................................... 270

Discussion .................................................................................. 272

7.3 Numeric Prediction With Local Linear Models........................273

Model Trees ............................................................................... 274

Building the Tree....................................................................... 275

Pruning the Tree ........................................................................ 275

Nominal Attributes .................................................................... 276

Missing Values .......................................................................... 276

Pseudocode for Model Tree Induction...................................... 277

Rules From Model Trees........................................................... 281

Locally Weighted Linear Regression........................................ 281

Discussion .................................................................................. 283

7.4 WEKA Implementations............................................................284

CHAPTER 8 Data transformations .................................................... 285

8.1 Attribute Selection .....................................................................288

Scheme-Independent Selection.................................................. 289

Searching the Attribute Space ................................................... 292

Scheme-Specific Selection ........................................................ 293

8.2 Discretizing Numeric Attributes................................................296

Unsupervised Discretization...................................................... 297

Entropy-Based Discretization.................................................... 298

Other Discretization Methods.................................................... 301

Entropy-Based Versus Error-Based Discretization................... 302

Converting Discrete to Numeric Attributes .............................. 303

8.3 Projections .................................................................304

Principal Component Analysis .................................................. 305

Random Projections................................................................... 307

Partial Least Squares Regression .............................................. 307

Independent Component Analysis............................................. 309

Linear Discriminant Analysis.................................................... 310

Quadratic Discriminant Analysis .............................................. 310

Fisher’s Linear Discriminant Analysis...................................... 311

Text to Attribute Vectors........................................................... 313

Time Series ................................................................................ 314

8.4 Sampling.....................................................................................315

Reservoir Sampling ................................................................... 315

8.5 Cleansing ....................................................................................316

Improving Decision Trees ......................................................... 316

Robust Regression ..................................................................... 317

Detecting Anomalies ................................................................. 318

One-Class Learning ................................................................... 319

Outlier Detection ....................................................................... 320

Generating Artificial Data ......................................................... 321

8.6 Transforming Multiple Classes to Binary Ones........................322

Simple Methods ......................................................................... 323

Error-Correcting Output Codes ................................................. 324

Ensembles of Nested Dichotomies............................................ 326

8.7 Calibrating Class Probabilities...................................................328

8.8 Further Reading and Bibliographic Notes.................................331

8.9 WEKA Implementations............................................................334

CHAPTER 9 Probabilistic methods .................................................. 335

9.1 Foundations ...............................................................336

Maximum Likelihood Estimation ............................................. 338

Maximum a Posteriori Parameter Estimation ........................... 339

9.2 Bayesian Networks.....................................................................339

Making Predictions.................................................................... 340

Learning Bayesian Networks .................................................... 344

Specific Algorithms ................................................................... 347

Data Structures for Fast Learning ............................................. 349

9.3 Clustering and Probability Density Estimation.........................352

The Expectation Maximization Algorithm for a Mixture
of Gaussians...................................353

Extending the Mixture Model ................................................... 356

Clustering Using Prior Distributions......................................... 358

Clustering With Correlated Attributes ...................................... 359

Kernel Density Estimation ........................................................ 361

Comparing Parametric, Semiparametric and Nonparametric
Density Models for Classification................362

9.4 Hidden Variable Models............................................................363

Expected Log-Likelihoods and Expected Gradients................. 364

The Expectation Maximization Algorithm ............................... 365

Applying the Expectation Maximization Algorithm
to Bayesian Networks.................................366

9.5 Bayesian Estimation and Prediction..........................................367

Probabilistic Inference Methods................................................ 368

9.6 Graphical Models and Factor Graphs........................................370

Graphical Models and Plate Notation ....................................... 371

Probabilistic Principal Component Analysis............................. 372

Latent Semantic Analysis.......................................................... 376

Using Principal Component Analysis for Dimensionality
Reduction ...................................377

Probabilistic LSA....................................................................... 378

Latent Dirichlet Allocation........................................................ 379

Factor Graphs............................................................................. 382

Markov Random Fields ............................................................. 385

Computing Using the Sum-Product and Max-Product
Algorithms ...........................................386

9.7 Conditional Probability Models.................................................392

Linear and Polynomial Regression as Probability
Models.................................................392

Using Priors on Parameters ....................................................... 393

Multiclass Logistic Regression.................................................. 396

Gradient Descent and Second-Order Methods.......................... 400

Generalized Linear Models ....................................................... 400

Making Predictions for Ordered Classes................................... 402

Conditional Probabilistic Models Using Kernels...................... 402

9.8 Sequential and Temporal Models............................................403

Markov Models and N-gram Methods .................................... 403

Hidden Markov Models........................................................... 404

Conditional Random Fields.....................................................406

9.9 Further Reading and Bibliographic Notes...............................410

Software Packages and Implementations................................414

9.10 WEKA Implementations..........................................................416

CHAPTER 10 Deep learning .................................................. 417

10.1 Deep Feedforward Networks...................................................420

The MNIST Evaluation ........................................................... 421

Losses and Regularization.......................................................422

Deep Layered Network Architecture ...................................... 423

Activation Functions................................................................ 424

Backpropagation Revisited......................................................426

Computation Graphs and Complex Network Structures ........ 429

Checking Backpropagation Implementations ......................... 430

10.2 Training and Evaluating Deep Networks................................431

Early Stopping ......................................................................... 431

Validation, Cross-Validation, and Hyperparameter Tuning ................ 432

Mini-Batch-Based Stochastic Gradient Descent.....................433

Pseudocode for Mini-Batch Based Stochastic Gradient
Descent......................................434

Learning Rates and Schedules................................................. 434

Regularization With Priors on Parameters.............................. 435

Dropout .................................................................................... 436

Batch Normalization................................................................436

Parameter Initialization............................................................ 436

Unsupervised Pretraining......................................................... 437

Data Augmentation and Synthetic Transformations...............437

10.3 Convolutional Neural Networks..............................................437
The ImageNet Evaluation and Very Deep Convolutional Networks ........................................438
From Image Filtering to Learnable Convolutional Layers..................439
Convolutional Layers and Gradients.......................................443
Pooling and Subsampling Layers and Gradients .................... 444
Implementation ........................................................................ 445
10.4 Autoencoders............................................................................445

Pretraining Deep Autoencoders With RBMs..........................448

Denoising Autoencoders and Layerwise Training...............................448

Combining Reconstructive and Discriminative Learning................................. 449

10.5 Stochastic Deep Networks.......................................................449

Boltzmann Machines ............................................................... 449

Restricted Boltzmann Machines.............................................. 451

Contrastive Divergence ........................................................... 452

Categorical and Continuous Variables....................................452

Deep Boltzmann Machines...................................................... 453

Deep Belief Networks ............................................................. 455

10.6 Recurrent Neural Networks.....................................................456

Exploding and Vanishing Gradients ....................................... 457

Other Recurrent Network Architectures ................................. 459

10.7 Further Reading and Bibliographic Notes...............................461

10.8 Deep Learning Software and Network Implementations.....................464
Theano...................................................................................... 464

Tensor Flow ............................................................................. 464

Torch ........................................................................................ 465

Computational Network Toolkit.............................................. 465
Caffe......................................................................................... 465
Deeplearning4j......................................................................... 465

Other Packages: Lasagne, Keras, and cuDNN........................ 465

10.9 WEKA Implementations..........................................................466

CHAPTER 11 Beyond supervised and unsupervised learning .................. 467
11.1 Semisupervised Learning.........................................................468

Clustering for Classification....................................................468

Cotraining ................................................................................ 470

EM and Cotraining .................................................................. 471

Neural Network Approaches ................................................... 471

11.2 Multi-instance Learning...........................................................472

Converting to Single-Instance Learning ................................. 472

Upgrading Learning Algorithms ............................................. 475

Dedicated Multi-instance Methods.......................................... 475

11.3 Further Reading and Bibliographic Notes...............................477

11.4 WEKA Implementations..........................................................478

CHAPTER 12 Ensemble learning...................................................... 479

12.1 Combining Multiple Models....................................................480

12.2 Bagging ....................................................................................481

Bias-Variance Decomposition ............................................... 482

Bagging With Costs................................................................. 483

12.3 Randomization .........................................................................484

Randomization Versus Bagging .............................................. 485

Rotation Forests ....................................................................... 486

12.4 Boosting ...................................................................................486
AdaBoost.................................................................................. 487

The Power of Boosting............................................................489

12.5 Additive Regression.................................................................490

Numeric Prediction..................................................................491

Additive Logistic Regression .................................................. 492

12.6 Interpretable Ensembles...........................................................493

Option Trees ............................................................................ 494

Logistic Model Trees............................................................... 496

12.7 Stacking.................................................................................497

12.8 Further Reading and Bibliographic Notes...............................499

12.9 WEKA Implementations..........................................................501

CHAPTER 13 Moving on: applications and beyond..................... 503

13.1 Applying Machine Learning..................................................504

13.2 Learning From Massive Datasets..........................................506

13.3 Data Stream Learning............................................................509

13.4 Incorporating Domain Knowledge........................................512

13.5 Text Mining ...........................................................................515

Document Classification and Clustering............................... 516

Information Extraction........................................................... 517

Natural Language Processing ................................................ 518

13.6 Web Mining...........................................................................519

Wrapper Induction ................................................................. 519

Page Rank .............................................................................. 520

13.7 Images and Speech ................................................................522

Images .................................................................................... 523

Speech .................................................................................... 524

13.8 Adversarial Situations............................................................524

13.9 Ubiquitous Data Mining........................................................527

13.10 Further Reading and Bibliographic Notes ............................529

13.11 WEKA Implementations .......................................................532

Appendix A: Theoretical foundations................................................................533

Appendix B: The WEKA workbench ...................................................................553
References...........................................................................573

Index ..................................................................................601

List of Figures

Figure 1.1 Rules for the contact lens data. 13
Figure 1.2 Decision tree for the contact lens data. 14
Figure 1.3 Decision trees for the labor negotiations data. 18
Figure 1.4 Life cycle of a data mining project. 29
Figure 2.1 A family tree and two ways of expressing the sister-of relation.48
Figure 2.2 ARFF file for the weather data. 58
Figure 2.3 Multi-instance ARFF file for the weather data. 60
Figure 3.1 A linear regression function for the CPU performance data.69　
Figure 3.2 A linear decision boundary separating Iris setosas from Iris versicolors.70
Figure 3.3 Constructing a decision tree interactively: (A) creating a 73
rectangular test involving petallength and petalwidth;
(B) the resulting (unfinished) decision tree.
Figure 3.4 Models for the CPU performance data: (A) linear 74
regression; (B) regression tree; (C) model tree.
Figure 3.5 Decision tree for a simple disjunction. 76
Figure 3.6 The exclusive-or problem. 77
Figure 3.7 Decision tree with a replicated subtree. 77
Figure 3.8 Rules for the iris data. 81
Figure 3.9 The shapes problem. 82
Figure 3.10 Different ways of partitioning the instance space. 86
Figure 3.11 Different ways of representing clusters. 88
Figure 4.1 Pseudocode for 1R. 93
Figure 4.2 Tree stumps for the weather data. 106
Figure 4.3 Expanded tree stumps for the weather data. 108
Figure 4.4 Decision tree for the weather data. 109
Figure 4.5 Tree stump for the ID code attribute. 111
Figure 4.6 Covering algorithm: (A) covering the instances; 113
(B) decision tree for the same problem.
Figure 4.7 The instance space during operation of a covering 115
algorithm.
Figure 4.8 Pseudocode for a basic rule learner. 118
Figure 4.9　(A) Finding all item sets with sufficient coverage;　127　
(B) finding all sufficiently accurate association rules　for a k-item set.
Figure 4.10 Logistic regression: (A) the logit transform; (B) example　130　
logistic regression function.
Figure 4.11 The perceptron: (A) learning rule; (B) representation as　132　
a neural network.
Figure 4.12 The Winnow algorithm: (A) unbalanced version;　134　
(B) balanced version.
Figure 4.13 A kD-tree for four training instances: (A) the tree;　137　
(B) instances and splits.
Figure 4.14 Using a kD-tree to find the nearest neighbor of the star.　137　
Figure 4.15 Ball tree for 16 training instances: (A) instances and　139　
balls; (B) the tree.
Figure 4.16 Ruling out an entire ball (gray) based on a target point　140　
(star) and its current nearest neighbor.
Figure 4.17 Iterative distance-based clustering.　143
Figure 4.18 A ball tree: (A) two cluster centers and their dividing　145　
line; (B) corresponding tree.
Figure 4.19 Hierarchical clustering displays.　149
Figure 4.20 Clustering the weather data.　151
Figure 4.21 Hierarchical clusterings of the iris data.　153
Figure 5.1 A hypothetical lift chart.　185
Figure 5.2 Analyzing the expected benefit of a mailing campaign　187　
when the cost of mailing is (A) $0.50 and (B) $0.80.
Figure 5.3 A sample ROC curve.　188
Figure 5.4 ROC curves for two learning schemes.　189
Figure 5.5 Effect of varying the probability threshold: (A) error　193　
curve; (B) cost curve.
Figure 6.4 Example of subtree raising, where node C is “raised”　214　
to subsume node B.
Figure 6.5 Pruning the labor negotiations decision tree.　216
Figure 6.6 Algorithm for forming rules by incremental reduced- 226　
error pruning.
Figure 6.7 RIPPER: (A) algorithm for rule learning; (B) meaning　228　
of symbols.
Figure 6.1 Algorithm for expanding examples into a partial tree.　229
Figure 6.2 Example of building a partial tree.　230
Figure 6.3　Rules with exceptions for the iris data.　232　
Figure 6.8 Extended prefix trees for the weather data: (A) the full　238　
data; (B) the data conditional on temperature 5 mild;　
(C) the data conditional on humidity 5 normal.
Figure 7.1 A boundary between two rectangular classes.　249
Figure 7.2 A maximum margin hyperplane.　253
Figure 7.3 Support vector regression: (A) ε 5 1; (B) ε 5 2; (C) ε 5 0.5.　257
Figure 7.4 Example data sets and corresponding perceptrons.　262
Figure 7.5 Step vs sigmoid: (A) step function; (B) sigmoid function.　264
Figure 7.6 Gradient descent using the error function w 2 1 1.　265
Figure 7.7 Multilayer perceptron with a hidden layer　267　
(omitting bias inputs).
Figure 7.8 Hinge, squared and 0.1 loss functions.　271
Figure 7.9 Pseudocode for model tree induction.　278
Figure 7.10 Model tree for a data set with nominal attributes.　279
Figure 8.1 Attribute space for the weather dataset.　292
Figure 8.2　Discretizing the temperature attribute using the entropy method. 299
Figure 8.3　The result of discretizing the temperature attribute.　299
Figure 8.4 Class distribution for a two-class, two-attribute problem. 302
Figure 8.5 Principal component transform of a dataset: (A) variance　306　
of each component; (B) variance plot.
Figure 8.6 Comparing principal component analysis and Fisher’s　312　
linear discriminant analysis.
Figure 8.7　Number of international phone calls from Belgium,　318　
1950-1973.
Figure 8.8 Overoptimistic probability estimation for a two-class problem. 329
Figure 9.1 A simple Bayesian network for the weather data.　341
Figure 9.2 Another Bayesian network for the weather data.　342
Figure 9.3 The Markov blanket for variable x6 in a 10-variable　348　
Bayesian network.
Figure 9.4 The weather data: (A) reduced version; (B)　350　
corresponding AD tree.
Figure 9.5 A two-class mixture model.　354
Figure 9.6 DensiTree showing possible hierarchical clusterings of a　360　
given data set.
Figure 9.7 Probability contours for three types of model, all based on Gaussians. 362　
Figure 9.8 (A)Bayesian network for a mixture model; (B) multiple 371

copies of the Bayesian network, one for each
observation; (C) plate notation version of (B).
Figure 9.9 (A)Bayesian network for probabilistic PCA; (B) equal-372

probability contour for a Gaussian distribution along
with its covariance matrix’s principal eigenvector.
Figure 9.10 The singular value decomposition of a t by d matrix. 377
Figure 9.11 Graphical models for (A) pLSA, (B) LDAb, and (C) 379

smoothed LDAb.
Figure 9.12 (A)Bayesian network and (B) corresponding factor graph. 382
Figure 9.13 The Markov blanket for variable x6 in a 10-variable
factor graph. 383
Figure 9.14 (A) and (B) Bayesian network and corresponding factor 384

graph; (C) and (D) Naive Bayes model and
corresponding factor graph.
Figure 9.15 (A)Bayesian network representing the joint distribution 384

of y and its parents; (B) factor graph for a logistic
regression for the conditional distribution of y given its
parents.
Figure 9.16 (A) Undirected graph representing a Markov random 385

field structure; (B) corresponding factor graph.
Figure 9.17 Message sequence in an example factor graph. 389
Figure 9.18 (A) and (B) First-and second-order Markov models for a 404

sequence of variables; (C) Hidden Markov model; (D)
Markov random field.
Figure 9.19 Mining emails for meeting details. 406
Figure 9.20 (A)Dynamic Bayesian network representation of a 407 hidden Markov model; (B) similarly structured Markov
random field; (C) factor graph for (A); and (D) factor graph for a linear chain conditional random field.
Figure 10.1 A feedforward neural network. 424
Figure 10.2 Computation graph showing forward propagation in a 426
deep network.
Figure 10.3 Backpropagation in a deep network (the forward 429
computation is shown with gray arrows).
Figure 10.4　
Parameter updates that follow the forward and backward 430

propagation steps (shown with gray arrows).
Figure 10.5 Typical learning curves for the training and
validation sets. 431
Figure 10.6 Pseudocode for mini-batch based stochastic gradient　435　
descent.
Figure 10.7 Typical convolutional neural network architecture.　439
Figure 10.8 Original image; filtered with the two Sobel operators;　441　
magnitude of the result.
Figure 10.9 Examples of what random neurons detect in different　442　
layers of a convolutional neural network using the　visualization approach of Zeiler and Fergus (2013).　
Underlying imagery kindly provided by Matthew Zeiler.
Figure 10.10 Example of the convolution, pooling, and decimation　443　
operations used in convolutional neural networks.
Figure 10.11 A simple autoencoder.　445
Figure 10.12 A deep autoencoder with multiple layers of　447　
transformation.
Figure 10.13 Low-dimensional principal component space (left)　447　
compared with one learned by a deep autoencoder (right).
Figure 10.14 Boltzmann machines: (A) fully connected; (B) restricted;　450　
(C) more general form of (B).
Figure 10.15 (A) Deep Boltzmann machine and (B) deep belief　453　
network.
Figure 10.16 (A) Feedforward network transformed into a recurrent　456　
network; (B) hidden Markov model; and (C) recurrent　
network obtained by unwrapping (A).
Figure 10.17 Structure of a “long short-term memory” unit.　459
Figure 10.18 Recurrent neural networks: (A) bidirectional,　460　
(B) encoder-decoder.
Figure 10.19 A deep encoder-decoder recurrent network.　460
Figure 12.1 Algorithm for bagging.　483
Figure 12.2 Algorithm for boosting.　488
Figure 12.3 Algorithm for additive logistic regression.　493
Figure 12.4 Simple option tree for the weather data.　494
Figure 12.5 Alternating decision tree for the weather data.　495
Figure 13.1 A tangled web.　521　

List of Tables

Table 1.1 The Contact Lens Data 7
Table 1.2 The Weather Data 11
Table 1.3 Weather Data With Some Numeric Attributes 12
Table 1.4 The Iris Data 15
Table 1.5 The CPU Performance Data 16
Table 1.6 The Labor Negotiations Data 17
Table 1.7 The Soybean Data 20
Table 2.1 Iris Data as a Clustering Problem 46
Table 2.2 Weather Data With a Numeric Class 47 Table 2.3 Family Tree 48
Table 2.4 The Sister-of Relation 49
Table 2.5 Another Relation 52
Table 3.1 A New Iris Flower 80
Table 3.2 Training Data for the Shapes Problem 83
Table 4.1 Evaluating the Attributes in the Weather Data 94
Table 4.2 The Weather Data, With Counts and Probabilities 97
Table 4.3 A New Day 98
Table 4.4 The Numeric Weather Data With Summary Statistics 101
Table 4.5 Another New Day 102
Table 4.6 The Weather Data with Identification Codes 111
Table 4.7 Gain Ratio Calculations for the Tree Stumps of Fig. 4.2 112
Table 4.8 Part of the Contact Lens Data for which Astigmatism = Yes 116
Table 4.9 Part of the Contact Lens Data for Which 117
Astigmatism 5 Yes and Tear Production Rate 5 Normal
Table 4.10 Item Sets for the Weather Data With Coverage 2 121
or Greater
Table 4.11 Association Rules for the Weather Data 123
Table 5.1 Confidence Limits for the Normal Distribution 166
Table 5.2 Confidence Limits for Student’s Distribution With 9 Degrees of Freedom 174
Table 5.3 Different Outcomes of a Two-Class Prediction 180
Table 5.4 Different Outcomes of a Three-Class Prediction: 181
(A) Actual; (B) Expected
Table 5.5 Default Cost Matrixes: (A) Two-Class Case; 182
(B) Three-Class Case
Table 5.6
Data for a Lift Chart　184
Table 5.7
Different Measures Used to Evaluate the False Positive　191　
Versus False Negative Tradeoff
Table 5.8 Performance Measures for Numeric Prediction　195
Table 5.9
Performance Measures for Four Numeric Prediction Models　197　
Table 6.1
Preparing the Weather Data for Insertion Into an FP-tree:　236　
(A) The Original Data; (B) Frequency Ordering of Items　
With Frequent Item Sets in Bold; (C) The Data With Each　
Instance Sorted Into Frequency Order; (D) The Two　
Multiple-Item Frequent Item Sets
Table 7.1
Linear Models in the Model Tree　280
Table 8.1
The First Five Instances From the CPU Performance Data;　308　
(A) Original Values; (B) The First Partial Least Squares　
Direction; (C) Residuals From the First Direction
Table 8.2
Transforming a Multiclass Problem Into a Two-Class One:　324　
(A) Standard Method; (B) Error-Correcting Code
Table 8.3 A Nested Dichotomy in the Form of a Code Matrix　327
Table 9.1
Highest Probability Words and User Tags From a Sample　381　
of Topics Extracted From a Collection of Scientific Articles
Table 9.2 Link Functions, Mean Functions, and Distributions Used in　401　
Generalized Linear Models
Table 10.1 Summary of Performance on the MNIST Evaluation　421
Table 10.2 Loss Functions, Corresponding Distributions, and　423　
Activation Functions
Table 10.3 Activation Functions and Their Derivatives　425
Table 10.4 Convolutional Neural Network Performance on the　439　
ImageNet Challenge
Table 10.5 Components of a “Long Short-Term Memory” Recurrent　459　
Neural Network
Table 13.1 The Top 10 Algorithms in Data Mining, According to a　504　
2006 Poll