《投资学》是由三名美国知名学府的著名金融学教授撰写的优秀著作,是美国最好的商学院和管理学院的首选教材,在世界各国都有很大的影响,被广泛使用。全书详细讲解了投资领域中的风险组合理论、资本资产定价模型、套利定价理论、市场有效性、证券评估、衍生证券、资产组合管理等重要内容。本书适用于金融专业高年级本科生、研究生及MBA学生,金融领域的研究人员与从业者。
假如你需要分析和理解数据,那么本书以及Weka工具包是绝佳的起步。它既是新手必备的教科书,又能让像我这样的专家受益。
—— Jim Gray,1998年图灵奖获得者
本书是数据挖掘和机器学习领域的经典畅销教材,被国内外众多名校选用。第4版全面反映了该领域的最新技术变革,包括关于概率方法和深度学习的重要新章节。此外,备受欢迎的机器学习软件Weka再度升级,读者可以在友好的交互界面中执行数据挖掘任务,通过直观结果加深对算法的理解。
在追踪前沿技术的同时,第4版也继承了之前版本的风格和特色,基础知识清晰详细,实践工具和技术指导具体实用。从准备输入、解释输出和评估结果,到数据挖掘的核心算法,无一不得到了简洁而优雅的呈现。
作者简介
伊恩 H. 威腾(Ian H. Witten) 新西兰怀卡托大学计算机科学系教授,ACM 会士,新西兰皇家学会会士,曾荣获2004年国际信息处理研究协会(IFIP)颁发的Namur奖。
埃贝·弗兰克(Eibe Frank) 新西兰怀卡托大学计算机科学系副教授,因Weka软件的成功而与Witten及Hall一道获得了2005年ACM SIGKDD服务奖。
马克 A. 霍尔(Mark A. Hall) 新西兰怀卡托大学名誉副研究员,Weka软件的核心开发者。
克里斯多夫 J. 帕尔(Christopher J. Pal) 蒙特利尔工程学院副教授,研究方向包括人工智能、计算机视觉和模式识别等。
The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expec-tations, that underlie the data. There is a huge amount of information locked up in databases—information that is potentially important but has not yet been discov-ered or articulated. Our mission is to bring it forth.
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts will be garbled, some miss-ing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.
Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases—information i.e., ideally, expressed in a comprehensible form and can be used for a variety of purposes. The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning that are used in practical data mining for finding, and if possible describing, structural patterns in data.
As with any burgeoning new technology that enjoys intense commercial atten-tion, the use of machine learning is surrounded by a great deal of hype in the technical—and sometimes the popular—press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techni-ques and shows how they work.
In many applications machine learning enables the acquisition of structural descriptions from examples. The kind of descriptions that are found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally—perhaps more—interested in applications where the result of “learning” is an actual description of a structure that can be used to classify examples. This structural description supports explanation and under-standing as well as prediction. In our experience, insights gained by the user are
of most interest in the majority of practical data mining applications; indeed, this is one of machine learning’s major advantages over classical statistical modeling.
The book explains a wide variety of machine learning methods. Some are ped-agogically motivated: simple schemes that are designed to explain clearly how the basic ideas work. Others are practical: real systems that are used in applica-tions today. Many are contemporary and have been developed only in the last few years.
A comprehensive software resource has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or WEKA1 for short, it is available as Java source code at www.cs.waikato.ac.nz/ml/weka. It is a full, industrial-strength implementation of most of the techniques that are covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers clean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.
The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability. If you wish to experiment with your own data, you will be able to do this easily with the WEKA software. But WEKA is by no means the only choice. For example, the freely available statistical computing environment R includes many machine learning algorithms. Devotees of the Python programming language might look at a popular library called scikit-learn. Modern “big data” frameworks for distrib-uted computing, such as Apache Spark, include support for machine learning. There is a plethora of options for deploying machine learning in practice. This book discusses fundamental learning algorithms without delving into software-specific implementation details. When appropriate, we point out where the algorithms we discuss can be found in the WEKA software. We also briefly introduce other machine learning software for so-called “deep learning” from high-dimensional data. However, most software-specific information is relegated to appendices.
The book spans the gulf between the intensely practical approach taken by trade books that provide case studies on data mining and the more theoretical, principle-driven exposition found in current textbooks on machine learning. (A brief description of these books appears in the Further reading section at the end of chapter: What’s it all about ) This gulf is rather wide. To apply machine learning techniques productively, you need to understand something about how they work; this is not a technology that you can apply blindly and expect to get good results. Different problems yield to different techniques, but it is rarely obvi-ous which techniques are suitable for a given situation: you need to know some-thing about the range of possible solutions. And we cover an extremely wide range of techniques. We can do this because, unlike many trade books, this vol-ume does not promote any particular commercial software or approach. We include a large number of examples, but they use illustrative datasets that are small enough to allow you to follow what is going on. Real datasets are far too large to show this (and in any case are usually company confidential). Our data-sets are chosen not to illustrate actual large-scale practical problems, but to help you understand what the different techniques do, how they work, and what their range of application is.
Found only on the islands of New Zealand, the weka (pronounced to rhyme with “Mecca”) is a flightless bird with an inquisitive nature.
The book is aimed at the technically aware general reader who is interested in the principles and ideas underlying the current practice of machine learning. It will also be of interest to information professionals who need to become acquainted with this new technology, and to all those who wish to gain a detailed technical understanding of what machine learning involves. It is written for an eclectic audience of information systems practitioners, programmers, consultants, developers, data scientists, information technology managers, specification wri-ters, patent examiners, curious lay people—as well as students and professors— who need an easy-to-read book with lots of illustrations that describes what the major machine learning techniques are, what they do, how they are used, and how they work. It is practically oriented, with a strong “how to” flavor, and includes algorithms, and often pseudo-code. All those involved in practical data mining will benefit directly from the techniques described. The book is aimed at people who want to cut through to the reality that underlies the hype about machine learning and who seek a practical, nonacademic, unpretentious approach. In most of the book we have avoided requiring any specific theoretical or mathe-matical knowledge. However, recognizing the growing complexity of the subject as it matures, we have included substantial theoretical material in Chapter 9, Probabilistic methods, and Chapter 10, Deep learning, because this is necessary for a full appreciation of recent practical techniques, in particular deep learning.
The book is organized in layers that make the ideas accessible to readers who are interested in grasping the basics, as well as to those who would like more depth of treatment, along with full details on the techniques covered. We believe that consumers of machine learning need to have some idea of how the algorithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all users to have a deep understand-ing of the finer details of the algorithms.
We address this situation by describing machine learning methods at succes-sive levels of detail. The book is divided into two parts. Part I is an introduction to machine learning for data mining. The reader will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter 1, What’s it all about , describes, through examples, what machine learning is, where it can be used; it also provides actual practical applications. Chapter 2, Input: concepts, instances, attributes, and Chapter 3, Output: knowledge representation, cover the different kinds of input and output—or knowledge representation—that are involved. Different kinds of output dictate different styles of algorithm, and Chapter 4, Algorithms: the basic methods, describes the basic methods of machine learning, simplified to make them easy to comprehend. Here the principles involved are conveyed in a variety of algorithms without getting involved in intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing. Chapter 5, Credibility: evaluating what’s been learned, which can be read out of sequence, equips the reader to evaluate the results that are obtained from machine learning, addressing the sometimes com-plex issues involved in performance evaluation.
Part II introduces advanced techniques of machine learning for data mining. At the lowest and most detailed level, Chapter 6, Trees and rules, and Chapter 7, Extending instance-based and linear models, expose in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms, including the complexities that are necessary for them to work well in practice (but omitting the heavy mathematical machinery that is required for a few of the algorithms). Although many readers may want to ignore such detailed informa-tion, it is at this level that full working implementations of machine learning schemes are written. Chapter 8, Data transformations, describes practical topics involved with engineering the input and output to machine learning—e.g., select-ing and discretizing attributes. Chapter 9, Probabilistic methods, and Chapter 10, Deep learning, provide a rigorous account of probabilistic methods for machine learning and deep learning respectively. Chapter 11, Beyond supervised and unsu-pervised learning, looks at semisupervised and multi-instance learning, while Chapter 12, Ensemble learning, covers techniques of “ensemble learning,” which combine the output from different learning techniques. Chapter 13, Moving on: applications and beyond, looks to the future.
The book describes most methods used in practical machine learning. However, it does not cover reinforcement learning because it is rarely applied in practical data mining; nor genetic algorithm approaches because these are really just optimization techniques that are not specific to machine learning; nor rela-tional learning and inductive logic programming because they are not very com-monly used in mainstream data mining applications.
An Appendix covers some mathematical background needed to follow the material in Chapter 9, Probabilistic methods, and Chapter 10, Deep learning. Another Appendix introduces the WEKA data mining workbench, which provides implementations of most of the ideas described in Parts I and II. We have done this in order to clearly separate conceptual material from the practical aspects of how to use it. At the end of each chapter in Parts I and II are pointers to related WEKA algorithms. You can ignore these, or look at them as you go along, or skip directly to the WEKA material if you are in a hurry to get on with analyzing your data and don’t want to be bothered with the technical details of how the algorithms work.
UPDATED AND REVISED CONTENT
We finished writing the first edition of this book in 1999, the second and third in 2005 and 2011 respectively, and now, in May 2016, are just polishing this fourth edition. How things have changed over the past couple of decades! While the basic core of material remains the same, we have made the most of opportunities to update it and add new material, and as a result the book has doubled in size to reflect the changes that have taken place. Of course, there have also been errors to fix, errors that we had accumulated in our publicly available errata file (avail-able through the book’s home page at http://www.cs.waikato.ac.nz/ml/weka/book. html).
SECOND EDITION
The major change in the second edition of the book was a separate part at the end of the book that included all the material on the WEKA machine learning work-bench. This allowed the main part of the book to stand alone, independent of the workbench. At that time WEKA, a widely used and popular feature of the first edition, had just acquired a radical new look in the form of an interactive graphi-cal user interface—or rather, three separate interactive interfaces—which made it far easier to use. The primary one is the “Explorer,” which gives access to all of WEKA’s facilities using menu selection and form filling. The others are the Knowledge Flow interface, which allows you to design configurations for streamed data processing, and the Experimenter, with which you set up automated experiments that run selected machine learning algorithms with different parame-ter settings on a corpus of datasets, collect performance statistics, and perform significance tests on the results. These interfaces lower the bar for becoming a practitioner of machine learning, and the second edition included a full descrip-tion of how to use them.
It also contained much new material that we briefly mention here. We extended the sections on rule learning and cost-sensitive evaluation. Bowing to popular demand, we added information on neural networks: the perceptron and the closely related Winnow algorithm; the multilayer perceptron and backpropa-gation algorithm. Logistic regression was also included. We described how to implement nonlinear decision boundaries using both the kernel perceptron and radial basis function networks, and also included support vector machines for regression. We incorporated a new section on Bayesian networks, again in response to readers’ requests and WEKA’s new capabilities in this regard, with a description of how to learn classifiers based on these networks, and how to imple-ment them efficiently using AD trees.
The previous 5 years (1999.2004) had seen great interest in data mining for text, and this was reflected in the introduction of string attributes in WEKA, mul-tinomial Bayes for document classification, and text transformations. We also described efficient data structures for searching the instance space: kD-trees and ball trees for finding nearest neighbors efficiently, and for accelerating distance-based clustering. We described new attribute selection schemes such as race search and the use of support vector machines; new methods for combining mod-els such as additive regression, additive logistic regression, logistic model trees, and option trees. We also covered recent developments in using unlabeled data to improve classification, including the cotraining and co-EM methods.
THIRD EDITION
For the third edition, we thoroughly edited the second edition and brought it up to date, including a great many new methods and algorithms. WEKA and the book were closely linked together—pretty well everything in WEKA was covered in the book. We also included far more references to the literature, practically tri-pling the number of references that were in the first edition.
As well as becoming far easier to use, WEKA had grown beyond recognition over the previous decade, and matured enormously in its data mining capabilities. It incorporates an unparalleled range of machine learning algorithms and related techniques. The growth has been partly stimulated by recent developments in the field, and is partly user-led and demand-driven. This puts us in a position where we know a lot about what actual users of data mining want, and we have capital-ized on this experience when deciding what to include in this book.
Here are a few of the highlights of the material that was added in the third edi-tion. A section on web mining was included, and, under ethics, a discussion of how individuals can often be “reidentified” from supposedly anonymized data. Other additions included techniques for multi-instance learning, new material on interactive cost-benefit analysis, cost-complexity pruning, advanced association rule algorithms that use extended prefix trees to store a compressed version of the dataset in main memory, kernel ridge regression, stochastic gradient descent, and hierarchical clustering methods. We added new data transformations: partial least squares regression, reservoir sampling, one-class learning, decomposing multi-class classification problems into ensembles of nested dichotomies, and calibrat-ing class probabilities. We added new information on ensemble learning techniques: randomization vs. bagging, and rotation forests. New sections on data stream learning and web mining were added as well.
FOURTH EDITION
One of the main drivers behind this fourth edition was a desire to add comprehen-sive material on the topic of deep learning, a new development that is essentially enabled by the emergence of truly vast data resources in domains like image and speech processing, and the availability of truly vast computational resources, including server farms and graphics processing units. However, deep learning techniques are heavily based on a potent mix of theory and practice. And we had also received other requests asking us to include more, and more rigorous, theo-retical material.
This forced us to rethink the role of theory in the book. We bit the bullet and added two new theoretically oriented chapters. Chapter 10, Deep learning, covers deep learning itself, and its predecessor, Chapter 9, Probabilistic methods, gives a principled theoretical development of probabilistic methods that is necessary to understand a host of other new algorithms. We recognize that many of our readers will not want to stomach all this theory, and we assure them that the remainder of the book has intentionally been left at a far simpler mathematical level. But this additional theoretical base puts some key material in the hands of readers who aspire to understand rapidly advancing techniques from the research world.
Developments in WEKA have proceeded apace. It now provides ways of reaching out and incorporating other languages and systems, such as the popular R statistical computing language, the Spark and Hadoop frameworks for distrib-uted computing, the Python and Groovy languages for scripting, and the MOA system for stream-oriented learning—to name but a few. Recognizing that it is not possible, and perhaps not desirable, to document such a comprehensive and fast-evolving system in a printed book, we have created a series of open online courses, Data Mining with Weka, More Data Mining with Weka, and Advanced Data Mining with Weka, to accompany the book (at https://weka.waikato.ac.nz).
The fourth edition contains numerous other updates and additions, and far more references to the literature. But enough of this: dive in and see for yourself.
ACKNOWLEDGMENTS
Writing the acknowledgments is always the nicest part! A lot of people have helped us, and we relish this opportunity to thank them. This book has arisen out of the machine learning research project in the Computer Science Department at the University of Waikato, New Zealand. We have received generous encouragement and assistance from the academic staff members early on in that project: John Cleary, Sally Jo Cunningham, Matt Humphrey, Lyn Hunt, Bob McQueen, Lloyd Smith, and Tony Smith. We also benefited greatly from interactions with staff members who arrived later: Michael Mayo and Robert Durrant. Special thanks go to Geoff Holmes, who led the project for many years, and Bernhard Pfahringer, who had significant input into many different aspects of the WEKA software. All who have worked on the machine learning project here have con-tributed to our thinking: we would particularly like to mention early students Steve Garner, Stuart Inglis and Craig Nevill-Manning for helping us to get the project off the ground in the beginning when success was less certain and things were more difficult.
The WEKA system that illustrates the ideas in this book forms a crucial component of it. It was conceived by the authors and designed and implemented principally by Eibe Frank, Mark Hall, Peter Reutemann, and Len Trigg, but many people in the machine learn-ing laboratory at Waikato made significant contributions. Since the first edition of the book the WEKA team has expanded considerably: so many people have contributed that it is impossible to acknowledge everyone properly. We are grateful to Chris Beckham, for con-tributing several packages to WEKA, Remco Bouckaert for his Bayes net package and many other contributions, Lin Dong for her implementations of multi-instance learning methods, Dale Fletcher for many database-related aspects, Kurt Driessens for his imple-mentation of Gaussian process regression, James Foulds for his work on multi-instance fil-tering, Anna Huang for information bottleneck clustering, Martin Gu¨tlein for his work on feature selection, Kathryn Hempstalk for her one-class classifier, Ashraf Kibriya and Richard Kirkby for contributions far too numerous to list, Nikhil Kishore for his implemen-tation of elastic net regression, Niels Landwehr for logistic model trees, Chi-Chung Lau for creating all the icons for the Knowledge Flow interface, Abdelaziz Mahoui for the imple-mentation of K., Jonathan Miles for his implementation of kernel filtering, Stefan Mutter for association rule mining, Malcolm Ware for numerous miscellaneous contributions, Haijian Shi for his implementations of tree learners, Marc Sumner for his work on speeding up logistic model trees, Tony Voyle for least-median-of-squares regression, Yong Wang for Pace regression and the original implementation of M50, Benjamin Weber for his great unification of WEKA parsing modules, and Xin Xu for his multi-instance learning pack-age, JRip, logistic regression and many other contributions. Our sincere thanks go to all these people for their dedicated work, and also to the many contributors to WEKA from outside our group at Waikato.
Tucked away as we are in a remote (but very pretty) corner of the southern hemisphere, we greatly appreciate the visitors to our department who play a crucial role in acting as sounding boards and helping us to develop our thinking. We would like to mention in par-ticular Rob Holte, Carl Gutwin, and Russell Beale, each of whom visited us for several months; David Aha, who although he only came for a few days did so at an early and frag-ile stage of the project and performed a great service by his enthusiasm and encourage-ment; and Kai Ming Ting, who worked with us for 2 years on many of the topics in this book, and helped to bring us into the mainstream of machine learning. More recent visitors included Arie Ben-David, Carla Brodley, Gregory Butler, Stefan Kramer, Johannes Schneider, Jan van Rijn and Michalis Vlachos, and many others who have given talks at our department. We would particularly like to thank Albert Bifet, who gave us detailed feedback on a draft version of the third edition—most of which we have incorporated.
Students at Waikato have played a significant role in the development of the project. Many of them are in the above list of WEKA contributors, but they have also contributed in other ways. In the early days, Jamie Littin worked on ripple-down rules and relational learning. Brent Martin explored instance-based learning and nested instance-based repre-sentations. Murray Fife slaved over relational learning, Nadeeka Madapathage investigated the use of functional languages for expressing machine learning algorithms. Kathryn Hempstalk worked on one-class learning and Richard Kirkby on data streams. Gabi Schmidberger worked on density estimation trees, Lan Huang on concept-based text clus-tering, and Alyona Medelyan on keyphrase extraction. More recently, Felipe Bravo has worked on sentiment classification for Twitter, Mi Li on fast clustering methods, and Tim Leathart on ensembles of nested dichotomies. Other graduate students have influenced us in numerous ways, particularly Gordon Paynter, YingYing Wen, and Zane Bray, who have worked with us on text mining, and Quan Sun and Xiaofeng Yu. Colleagues Steve Jones and Malika Mahoui have also made far-reaching contributions to these and other machine learning projects. We have also learned much from our many visiting students from Freiburg, including Nils Weidmann.
Ian Witten would like to acknowledge the formative role of his former students at Calgary, particularly Brent Krawchuk, Dave Maulsby, Thong Phan, and Tanja Mitrovic, all of whom helped him develop his early ideas in machine learning, as did faculty members Bruce MacDonald, Brian Gaines, and David Hill at Calgary, and John Andreae at the University of Canterbury.
Eibe Frank is indebted to his former supervisor at the University of Karlsruhe, Klaus-Peter Huber, who infected him with the fascination of machines that learn. On his travels Eibe has benefited from interactions with Peter Turney, Joel Martin, and Berry de Bruijn in Canada; Luc de Raedt, Christoph Helma, Kristian Kersting, Stefan Kramer, Ulrich Ru¨ckert, and Ashwin Srinivasan in Germany.
Mark Hall thanks his former supervisor Lloyd Smith, now at Missouri State University, who exhibited the patience of Job when his thesis drifted from its original topic into the realms of machine learning. The many and varied people who have been part of, or have visited, the machine learning group at the University of Waikato over the years deserve a special thanks for their valuable insights and stimulating discussions.
Chris Pal thanks his coauthors for the invitation to help write this fourth edition; and his family for accommodating the extra time spent writing. He thanks Polytechique Montre′al for the sabbatical that allowed him to travel to New Zealand, leading to this fruit-ful collaboration; and the University of Waikato Department of Computer Science for host-ing him. He also thanks his many mentors, academic family, coauthors, and colleagues over the years, whose perspectives have been enriching and have influenced the presenta-tion here, including: Brendan Frey, Geoff Hinton, Yoshua Bengio, Sam Roweis, Andrew McCallum, and Charles Sutton among many others. Particular thanks goes to Hugo Larochelle for his pedagogical perspectives on deep learning. Chris also thanks his friends and colleagues at the Montre′al Institute for Learning Algorithms, the Theano development team and all his current and former students for helping to create such a great environment for machine learning research. Particular thanks goes to Chris Beckham who provided excellent feedback on earlier drafts of the new chapters in this edition.
Charlie Kent and Tim Pitts of Morgan Kaufmann have worked hard to shape this book, and Nicky Carter, our production editor, has made the process go very smoothly. We would like to thank the librarians of the Repository of Machine Learning Databases at the University of California, Irvine, whose carefully collected datasets have been invaluable in our research.
Our research has been funded by the New Zealand Foundation for Research, Science and Technology and the Royal Society of New Zealand Marsden Fund. The Department of Computer Science at the University of Waikato has generously supported us in all sorts of ways, and we owe a particular debt of gratitude to Mark Apperley for his enlightened lead-ership and warm encouragement. Part of the first edition was written while both authors were visiting the University of Calgary, Canada, and the support of the Computer Science department there is gratefully acknowledged—as well as the positive and helpful attitude of the long-suffering students in the machine learning course, on whom we experimented. Part of the second edition was written at the University of Lethbridge in Southern Alberta on a visit supported by Canada’s Informatics Circle of Research Excellence.
Last, and most of all, we are grateful to our families and partners. Pam, Anna and Nikki were all too well aware of the implications of having an author in the house (“not again!”) but let Ian go ahead and write the book anyway. Julie was always support-ive, even when Eibe had to burn the midnight oil in the machine learning lab, and Immo and Ollig provided exciting diversions. Bernadette too was very supportive, somehow managing to keep the combined noise output of Charlotte, Luke, Zach, Kyle, and Francesca to a level that allowed Mark to concentrate. Between us we hail from Canada, England, Germany, Ireland, New Zealand, and Samoa: New Zealand has brought us together and provided an ideal, even idyllic, place to do this work.
计算机\数据挖掘
[新西兰]伊恩 H. 威腾(Ian H. Witten) 埃贝?弗兰克(Eibe Frank) 马克 A. 霍尔(Mark A. Hall) [加]克里斯多夫 J. 帕尔(Christopher J. Pal)著:
伊恩 H. 威腾(Ian H. Witten) 新西兰怀卡托大学计算机科学系教授,ACM 会士,新西兰皇家学会会士,曾荣获2004年国际信息处理研究协会(IFIP)颁发的Namur奖。
埃贝·弗兰克(Eibe Frank) 新西兰怀卡托大学计算机科学系副教授,因Weka软件的成功而与Witten及Hall一道获得了2005年ACM SIGKDD服务奖。
马克 A. 霍尔(Mark A. Hall) 新西兰怀卡托大学名誉副研究员,Weka软件的核心开发者。
克里斯多夫 J. 帕尔(Christopher J. Pal) 蒙特利尔工程学院副教授,研究方向包括人工智能、计算机视觉和模式识别等。
Preface ............................................................ iv
PART I INTRODUCTION TO DATA MINING
CHAPTER 1 What’s it all about ..................................................... 3
1.1 Data Mining and Machine Learning..............................................4
Describing Structural Patterns....................................................... 6
Machine Learning.......................................................................... 7
Data Mining ................................................................................... 9
1.2 Simple Examples: The Weather Problem and Others...................9
The Weather Problem.................................................................. 10
Contact Lenses: An Idealized Problem....................................... 12
Irises: A Classic Numeric Dataset .............................................. 14
CPU Performance: Introducing Numeric Prediction .................. 16
Labor Negotiations: A More Realistic Example......................... 16
Soybean Classification: A Classic Machine Learning
Success ............................................19
1.3 Fielded Applications ....................................................................21
Web Mining ................................................................................. 21
Decisions Involving Judgment .................................................... 22
Screening Images......................................................................... 23
Load Forecasting ......................................................................... 24
Diagnosis...................................................................................... 25
Marketing and Sales .................................................................... 26
Other Applications....................................................................... 27
1.4 The Data Mining Process.............................................................28
1.5 Machine Learning and Statistics..................................................30
1.6 Generalization as Search..............................................................31
Enumerating the Concept Space ................................................. 32
Bias ............................................................................... 33
1.7 Data Mining and Ethics...............................................................35
Reidentification............................................................................ 36
Using Personal Information......................................................... 37
Wider Issues................................................................................. 38
1.8 Further Reading and Bibliographic Notes...................................38
CHAPTER 2 Input: concepts, instances, attributes......................... 43
2.1 What’s a Concept ......... ............... ...............................................44
2.2 What’s in an Example ................. ...............................................46
Relations ............. .............................. ........................................... 47
Other Example Types ............... ............................ ....................... 51
2.3 What’s in an Attribute ................. ...............................................53
2.4 Preparing the Input......... ............... ...............................................56
Gathering the Data Together ............... ........................................ 56
ARFF Format .................................... ........................................... 57
Sparse Data ......... .............................. ........................................... 60
Attribute Types ................................. ........................................... 61
Missing Values ................................. ........................................... 62
Inaccurate Values.............................. ........................................... 63
Unbalanced Data............................... ........................................... 64
Getting to Know Your Data ............... ......................................... 65
2.5 Further Reading and Bibliographic Notes ...................................65
CHAPTER 3 Output: knowledge representation ............................... 67
3.1 Tables ............................. ............... ......................................68
3.2 Linear Models ............................... .........................................68
3.3 Trees .............................................. .....................................70
3.4 Rules .............................................. .......................................75
Classification Rules ............... ...................................................... 75
Association Rules .............................. .......................................... 79
Rules With Exceptions ............... .......................... ....................... 80
More Expressive Rules ............... .......................... ....................... 82
3.5 Instance-Based Representation ....................................................84
3.6 Clusters........................... ............... .............................87
3.7 Further Reading and Bibliographic Notes ...................................88
CHAPTER 4 Algorithms: the basic methods ..................................... 91
4.1 Inferring Rudimentary Rules .......................................................93
Missing Values and Numeric Attributes .............. ....................... 94
4.2 Simple Probabilistic Modeling ....................................................96
Missing Values and Numeric Attributes .............. ..................... 100
Naive Bayes for Document Classification ........... ..................... 103
Remarks .............. .............................. ......................................... 105
4.3 Divide-and-Conquer: Constructing Decision Trees ..................105
Calculating Information............... ......................... ..................... 108
Highly Branching Attributes ..................................................... 110
4.4 Covering Algorithms: Constructing Rules..............................113
Rules Versus Trees .................................................................. 114
A Simple Covering Algorithm ................................................ 115
Rules Versus Decision Lists....................................................119
4.5 Mining Association Rules........................................................120
Item Sets .................................................................................. 120
Association Rules .................................................................... 122
Generating Rules Efficiently ................................................... 124
4.6 Linear Models..........................................................................128
Numeric Prediction: Linear Regression .................................. 128
Linear Classification: Logistic Regression ............................. 129
Linear Classification Using the Perceptron ............................ 131
Linear Classification Using Winnow ...................................... 133
4.7 Instance-Based Learning..........................................................135
The Distance Function............................................................. 135
Finding Nearest Neighbors Efficiently ................................... 136
Remarks ........................................................................... 141
4.8 Clustering .................................................................141
Iterative Distance-Based Clustering........................................142
Faster Distance Calculations ................................................... 144
Choosing the Number of Clusters...........................................146
Hierarchical Clustering............................................................147
Example of Hierarchical Clustering........................................148
Incremental Clustering............................................................. 150
Category Utility ....................................................................... 154
Remarks ................................................................................... 156
4.9 Multi-instance Learning...........................................................156
Aggregating the Input..............................................................157
Aggregating the Output ........................................................... 157
4.10 Further Reading and Bibliographic Notes...............................158
4.11 WEKA Implementations..........................................................160
CHAPTER 5 Credibility: evaluating what’s been learned............. 161
5.1 Training and Testing..................................................................163
5.2 Predicting Performance..............................................................165
5.3 Cross-Validation.........................................................................167
5.4 Other Estimates..........................................................................169
Leave-One-Out .......................................................................... 169
The Bootstrap............................................................................. 169
5.5 Hyperparameter Selection..........................................................171
5.6 Comparing Data Mining Schemes...........................................172
5.7 Predicting Probabilities............................................................176
Quadratic Loss Function.......................................................... 177
Informational Loss Function ................................................... 178
Remarks ................................................................................... 179
5.8 Counting the Cost ....................................................................179
Cost-Sensitive Classification ................................................... 182
Cost-Sensitive Learning........................................................... 183
Lift Charts................................................................................183
ROC Curves.............................................................................186
Recall-Precision Curves........................................................... 190
Remarks ................................................................................... 190
Cost Curves..............................................................................192
5.9 Evaluating Numeric Prediction ...............................................194
5.10 The MDL Principle..................................................................197
5.11 Applying the MDL Principle to Clustering.............................200
5.12 Using a Validation Set for Model Selection...........................201
5.13 Further Reading and Bibliographic Notes...............................202
PART II MORE ADVANCED MACHINE LEARNING SCHEMES
CHAPTER 6 Trees and rules ........................................... 209
6.1 Decision Trees............................................................................210
Numeric Attributes .................................................................... 210
Missing Values .......................................................................... 212
Pruning....................................................................................... 213
Estimating Error Rates .............................................................. 215
Complexity of Decision Tree Induction.................................... 217
From Trees to Rules .................................................................. 219
C4.5: Choices and Options........................................................ 219
Cost-Complexity Pruning .......................................................... 220
Discussion .................................................................................. 221
6.2 Classification Rules....................................................................221
Criteria for Choosing Tests ....................................................... 222
Missing Values, Numeric Attributes......................................... 223
Generating Good Rules ............................................................. 224
Using Global Optimization........................................................ 226
Obtaining Rules From Partial Decision Trees .......................... 227
Rules With Exceptions .............................................................. 231
Discussion .................................................................................. 233
6.3 Association Rules.......................................................................234
Building a Frequent Pattern Tree .............................................. 235
Finding Large Item Sets ............................................................ 240
Discussion .................................................................................. 241
6.4 WEKA Implementations............................................................242
CHAPTER 7 Extending instance-based and linear models .......... 243
7.1 Instance-Based Learning............................................................244
Reducing the Number of Exemplars......................................... 245
Pruning Noisy Exemplars.......................................................... 245
Weighting Attributes ................................................................. 246
Generalizing Exemplars............................................................. 247
Distance Functions for Generalized Exemplars........................ 248
Generalized Distance Functions ................................................ 250
Discussion .................................................................................. 250
7.2 Extending Linear Models...........................................................252
The Maximum Margin Hyperplane........................................... 253
Nonlinear Class Boundaries ...................................................... 254
Support Vector Regression........................................................ 256
Kernel Ridge Regression........................................................... 258
The Kernel Perceptron............................................................... 260
Multilayer Perceptrons............................................................... 261
Radial Basis Function Networks ............................................... 270
Stochastic Gradient Descent...................................................... 270
Discussion .................................................................................. 272
7.3 Numeric Prediction With Local Linear Models........................273
Model Trees ............................................................................... 274
Building the Tree....................................................................... 275
Pruning the Tree ........................................................................ 275
Nominal Attributes .................................................................... 276
Missing Values .......................................................................... 276
Pseudocode for Model Tree Induction...................................... 277
Rules From Model Trees........................................................... 281
Locally Weighted Linear Regression........................................ 281
Discussion .................................................................................. 283
7.4 WEKA Implementations............................................................284
CHAPTER 8 Data transformations .................................................... 285
8.1 Attribute Selection .....................................................................288
Scheme-Independent Selection.................................................. 289
Searching the Attribute Space ................................................... 292
Scheme-Specific Selection ........................................................ 293
8.2 Discretizing Numeric Attributes................................................296
Unsupervised Discretization...................................................... 297
Entropy-Based Discretization.................................................... 298
Other Discretization Methods.................................................... 301
Entropy-Based Versus Error-Based Discretization................... 302
Converting Discrete to Numeric Attributes .............................. 303
8.3 Projections .................................................................304
Principal Component Analysis .................................................. 305
Random Projections................................................................... 307
Partial Least Squares Regression .............................................. 307
Independent Component Analysis............................................. 309
Linear Discriminant Analysis.................................................... 310
Quadratic Discriminant Analysis .............................................. 310
Fisher’s Linear Discriminant Analysis...................................... 311
Text to Attribute Vectors........................................................... 313
Time Series ................................................................................ 314
8.4 Sampling.....................................................................................315
Reservoir Sampling ................................................................... 315
8.5 Cleansing ....................................................................................316
Improving Decision Trees ......................................................... 316
Robust Regression ..................................................................... 317
Detecting Anomalies ................................................................. 318
One-Class Learning ................................................................... 319
Outlier Detection ....................................................................... 320
Generating Artificial Data ......................................................... 321
8.6 Transforming Multiple Classes to Binary Ones........................322
Simple Methods ......................................................................... 323
Error-Correcting Output Codes ................................................. 324
Ensembles of Nested Dichotomies............................................ 326
8.7 Calibrating Class Probabilities...................................................328
8.8 Further Reading and Bibliographic Notes.................................331
8.9 WEKA Implementations............................................................334
CHAPTER 9 Probabilistic methods .................................................. 335
9.1 Foundations ...............................................................336
Maximum Likelihood Estimation ............................................. 338
Maximum a Posteriori Parameter Estimation ........................... 339
9.2 Bayesian Networks.....................................................................339
Making Predictions.................................................................... 340
Learning Bayesian Networks .................................................... 344
Specific Algorithms ................................................................... 347
Data Structures for Fast Learning ............................................. 349
9.3 Clustering and Probability Density Estimation.........................352
The Expectation Maximization Algorithm for a Mixture
of Gaussians...................................353
Extending the Mixture Model ................................................... 356
Clustering Using Prior Distributions......................................... 358
Clustering With Correlated Attributes ...................................... 359
Kernel Density Estimation ........................................................ 361
Comparing Parametric, Semiparametric and Nonparametric
Density Models for Classification................362
9.4 Hidden Variable Models............................................................363
Expected Log-Likelihoods and Expected Gradients................. 364
The Expectation Maximization Algorithm ............................... 365
Applying the Expectation Maximization Algorithm
to Bayesian Networks.................................366
9.5 Bayesian Estimation and Prediction..........................................367
Probabilistic Inference Methods................................................ 368
9.6 Graphical Models and Factor Graphs........................................370
Graphical Models and Plate Notation ....................................... 371
Probabilistic Principal Component Analysis............................. 372
Latent Semantic Analysis.......................................................... 376
Using Principal Component Analysis for Dimensionality
Reduction ...................................377
Probabilistic LSA....................................................................... 378
Latent Dirichlet Allocation........................................................ 379
Factor Graphs............................................................................. 382
Markov Random Fields ............................................................. 385
Computing Using the Sum-Product and Max-Product
Algorithms ...........................................386
9.7 Conditional Probability Models.................................................392
Linear and Polynomial Regression as Probability
Models.................................................392
Using Priors on Parameters ....................................................... 393
Multiclass Logistic Regression.................................................. 396
Gradient Descent and Second-Order Methods.......................... 400
Generalized Linear Models ....................................................... 400
Making Predictions for Ordered Classes................................... 402
Conditional Probabilistic Models Using Kernels...................... 402
9.8 Sequential and Temporal Models............................................403
Markov Models and N-gram Methods .................................... 403
Hidden Markov Models........................................................... 404
Conditional Random Fields.....................................................406
9.9 Further Reading and Bibliographic Notes...............................410
Software Packages and Implementations................................414
9.10 WEKA Implementations..........................................................416
CHAPTER 10 Deep learning .................................................. 417
10.1 Deep Feedforward Networks...................................................420
The MNIST Evaluation ........................................................... 421
Losses and Regularization.......................................................422
Deep Layered Network Architecture ...................................... 423
Activation Functions................................................................ 424
Backpropagation Revisited......................................................426
Computation Graphs and Complex Network Structures ........ 429
Checking Backpropagation Implementations ......................... 430
10.2 Training and Evaluating Deep Networks................................431
Early Stopping ......................................................................... 431
Validation, Cross-Validation, and Hyperparameter Tuning ................ 432
Mini-Batch-Based Stochastic Gradient Descent.....................433
Pseudocode for Mini-Batch Based Stochastic Gradient
Descent......................................434
Learning Rates and Schedules................................................. 434
Regularization With Priors on Parameters.............................. 435
Dropout .................................................................................... 436
Batch Normalization................................................................436
Parameter Initialization............................................................ 436
Unsupervised Pretraining......................................................... 437
Data Augmentation and Synthetic Transformations...............437
10.3 Convolutional Neural Networks..............................................437
The ImageNet Evaluation and Very Deep Convolutional Networks ........................................438
From Image Filtering to Learnable Convolutional Layers..................439
Convolutional Layers and Gradients.......................................443
Pooling and Subsampling Layers and Gradients .................... 444
Implementation ........................................................................ 445
10.4 Autoencoders............................................................................445
Pretraining Deep Autoencoders With RBMs..........................448
Denoising Autoencoders and Layerwise Training...............................448
Combining Reconstructive and Discriminative Learning................................. 449
10.5 Stochastic Deep Networks.......................................................449
Boltzmann Machines ............................................................... 449
Restricted Boltzmann Machines.............................................. 451
Contrastive Divergence ........................................................... 452
Categorical and Continuous Variables....................................452
Deep Boltzmann Machines...................................................... 453
Deep Belief Networks ............................................................. 455
10.6 Recurrent Neural Networks.....................................................456
Exploding and Vanishing Gradients ....................................... 457
Other Recurrent Network Architectures ................................. 459
10.7 Further Reading and Bibliographic Notes...............................461
10.8 Deep Learning Software and Network Implementations.....................464
Theano...................................................................................... 464
Tensor Flow ............................................................................. 464
Torch ........................................................................................ 465
Computational Network Toolkit.............................................. 465
Caffe......................................................................................... 465
Deeplearning4j......................................................................... 465
Other Packages: Lasagne, Keras, and cuDNN........................ 465
10.9 WEKA Implementations..........................................................466
CHAPTER 11 Beyond supervised and unsupervised learning .................. 467
11.1 Semisupervised Learning.........................................................468
Clustering for Classification....................................................468
Cotraining ................................................................................ 470
EM and Cotraining .................................................................. 471
Neural Network Approaches ................................................... 471
11.2 Multi-instance Learning...........................................................472
Converting to Single-Instance Learning ................................. 472
Upgrading Learning Algorithms ............................................. 475
Dedicated Multi-instance Methods.......................................... 475
11.3 Further Reading and Bibliographic Notes...............................477
11.4 WEKA Implementations..........................................................478
CHAPTER 12 Ensemble learning...................................................... 479
12.1 Combining Multiple Models....................................................480
12.2 Bagging ....................................................................................481
Bias-Variance Decomposition ............................................... 482
Bagging With Costs................................................................. 483
12.3 Randomization .........................................................................484
Randomization Versus Bagging .............................................. 485
Rotation Forests ....................................................................... 486
12.4 Boosting ...................................................................................486
AdaBoost.................................................................................. 487
The Power of Boosting............................................................489
12.5 Additive Regression.................................................................490
Numeric Prediction..................................................................491
Additive Logistic Regression .................................................. 492
12.6 Interpretable Ensembles...........................................................493
Option Trees ............................................................................ 494
Logistic Model Trees............................................................... 496
12.7 Stacking.................................................................................497
12.8 Further Reading and Bibliographic Notes...............................499
12.9 WEKA Implementations..........................................................501
CHAPTER 13 Moving on: applications and beyond..................... 503
13.1 Applying Machine Learning..................................................504
13.2 Learning From Massive Datasets..........................................506
13.3 Data Stream Learning............................................................509
13.4 Incorporating Domain Knowledge........................................512
13.5 Text Mining ...........................................................................515
Document Classification and Clustering............................... 516
Information Extraction........................................................... 517
Natural Language Processing ................................................ 518
13.6 Web Mining...........................................................................519
Wrapper Induction ................................................................. 519
Page Rank .............................................................................. 520
13.7 Images and Speech ................................................................522
Images .................................................................................... 523
Speech .................................................................................... 524
13.8 Adversarial Situations............................................................524
13.9 Ubiquitous Data Mining........................................................527
13.10 Further Reading and Bibliographic Notes ............................529
13.11 WEKA Implementations .......................................................532
Appendix A: Theoretical foundations................................................................533
Appendix B: The WEKA workbench ...................................................................553
References...........................................................................573
Index ..................................................................................601
List of Figures
Figure 1.1 Rules for the contact lens data. 13
Figure 1.2 Decision tree for the contact lens data. 14
Figure 1.3 Decision trees for the labor negotiations data. 18
Figure 1.4 Life cycle of a data mining project. 29
Figure 2.1 A family tree and two ways of expressing the sister-of relation.48
Figure 2.2 ARFF file for the weather data. 58
Figure 2.3 Multi-instance ARFF file for the weather data. 60
Figure 3.1 A linear regression function for the CPU performance data.69
Figure 3.2 A linear decision boundary separating Iris setosas from Iris versicolors.70
Figure 3.3 Constructing a decision tree interactively: (A) creating a 73
rectangular test involving petallength and petalwidth;
(B) the resulting (unfinished) decision tree.
Figure 3.4 Models for the CPU performance data: (A) linear 74
regression; (B) regression tree; (C) model tree.
Figure 3.5 Decision tree for a simple disjunction. 76
Figure 3.6 The exclusive-or problem. 77
Figure 3.7 Decision tree with a replicated subtree. 77
Figure 3.8 Rules for the iris data. 81
Figure 3.9 The shapes problem. 82
Figure 3.10 Different ways of partitioning the instance space. 86
Figure 3.11 Different ways of representing clusters. 88
Figure 4.1 Pseudocode for 1R. 93
Figure 4.2 Tree stumps for the weather data. 106
Figure 4.3 Expanded tree stumps for the weather data. 108
Figure 4.4 Decision tree for the weather data. 109
Figure 4.5 Tree stump for the ID code attribute. 111
Figure 4.6 Covering algorithm: (A) covering the instances; 113
(B) decision tree for the same problem.
Figure 4.7 The instance space during operation of a covering 115
algorithm.
Figure 4.8 Pseudocode for a basic rule learner. 118
Figure 4.9 (A) Finding all item sets with sufficient coverage; 127
(B) finding all sufficiently accurate association rules for a k-item set.
Figure 4.10 Logistic regression: (A) the logit transform; (B) example 130
logistic regression function.
Figure 4.11 The perceptron: (A) learning rule; (B) representation as 132
a neural network.
Figure 4.12 The Winnow algorithm: (A) unbalanced version; 134
(B) balanced version.
Figure 4.13 A kD-tree for four training instances: (A) the tree; 137
(B) instances and splits.
Figure 4.14 Using a kD-tree to find the nearest neighbor of the star. 137
Figure 4.15 Ball tree for 16 training instances: (A) instances and 139
balls; (B) the tree.
Figure 4.16 Ruling out an entire ball (gray) based on a target point 140
(star) and its current nearest neighbor.
Figure 4.17 Iterative distance-based clustering. 143
Figure 4.18 A ball tree: (A) two cluster centers and their dividing 145
line; (B) corresponding tree.
Figure 4.19 Hierarchical clustering displays. 149
Figure 4.20 Clustering the weather data. 151
Figure 4.21 Hierarchical clusterings of the iris data. 153
Figure 5.1 A hypothetical lift chart. 185
Figure 5.2 Analyzing the expected benefit of a mailing campaign 187
when the cost of mailing is (A) $0.50 and (B) $0.80.
Figure 5.3 A sample ROC curve. 188
Figure 5.4 ROC curves for two learning schemes. 189
Figure 5.5 Effect of varying the probability threshold: (A) error 193
curve; (B) cost curve.
Figure 6.4 Example of subtree raising, where node C is “raised” 214
to subsume node B.
Figure 6.5 Pruning the labor negotiations decision tree. 216
Figure 6.6 Algorithm for forming rules by incremental reduced- 226
error pruning.
Figure 6.7 RIPPER: (A) algorithm for rule learning; (B) meaning 228
of symbols.
Figure 6.1 Algorithm for expanding examples into a partial tree. 229
Figure 6.2 Example of building a partial tree. 230
Figure 6.3 Rules with exceptions for the iris data. 232
Figure 6.8 Extended prefix trees for the weather data: (A) the full 238
data; (B) the data conditional on temperature 5 mild;
(C) the data conditional on humidity 5 normal.
Figure 7.1 A boundary between two rectangular classes. 249
Figure 7.2 A maximum margin hyperplane. 253
Figure 7.3 Support vector regression: (A) ε 5 1; (B) ε 5 2; (C) ε 5 0.5. 257
Figure 7.4 Example data sets and corresponding perceptrons. 262
Figure 7.5 Step vs sigmoid: (A) step function; (B) sigmoid function. 264
Figure 7.6 Gradient descent using the error function w 2 1 1. 265
Figure 7.7 Multilayer perceptron with a hidden layer 267
(omitting bias inputs).
Figure 7.8 Hinge, squared and 0.1 loss functions. 271
Figure 7.9 Pseudocode for model tree induction. 278
Figure 7.10 Model tree for a data set with nominal attributes. 279
Figure 8.1 Attribute space for the weather dataset. 292
Figure 8.2 Discretizing the temperature attribute using the entropy method. 299
Figure 8.3 The result of discretizing the temperature attribute. 299
Figure 8.4 Class distribution for a two-class, two-attribute problem. 302
Figure 8.5 Principal component transform of a dataset: (A) variance 306
of each component; (B) variance plot.
Figure 8.6 Comparing principal component analysis and Fisher’s 312
linear discriminant analysis.
Figure 8.7 Number of international phone calls from Belgium, 318
1950-1973.
Figure 8.8 Overoptimistic probability estimation for a two-class problem. 329
Figure 9.1 A simple Bayesian network for the weather data. 341
Figure 9.2 Another Bayesian network for the weather data. 342
Figure 9.3 The Markov blanket for variable x6 in a 10-variable 348
Bayesian network.
Figure 9.4 The weather data: (A) reduced version; (B) 350
corresponding AD tree.
Figure 9.5 A two-class mixture model. 354
Figure 9.6 DensiTree showing possible hierarchical clusterings of a 360
given data set.
Figure 9.7 Probability contours for three types of model, all based on Gaussians. 362
Figure 9.8 (A)Bayesian network for a mixture model; (B) multiple 371
copies of the Bayesian network, one for each
observation; (C) plate notation version of (B).
Figure 9.9 (A)Bayesian network for probabilistic PCA; (B) equal-372
probability contour for a Gaussian distribution along
with its covariance matrix’s principal eigenvector.
Figure 9.10 The singular value decomposition of a t by d matrix. 377
Figure 9.11 Graphical models for (A) pLSA, (B) LDAb, and (C) 379
smoothed LDAb.
Figure 9.12 (A)Bayesian network and (B) corresponding factor graph. 382
Figure 9.13 The Markov blanket for variable x6 in a 10-variable
factor graph. 383
Figure 9.14 (A) and (B) Bayesian network and corresponding factor 384
graph; (C) and (D) Naive Bayes model and
corresponding factor graph.
Figure 9.15 (A)Bayesian network representing the joint distribution 384
of y and its parents; (B) factor graph for a logistic
regression for the conditional distribution of y given its
parents.
Figure 9.16 (A) Undirected graph representing a Markov random 385
field structure; (B) corresponding factor graph.
Figure 9.17 Message sequence in an example factor graph. 389
Figure 9.18 (A) and (B) First-and second-order Markov models for a 404
sequence of variables; (C) Hidden Markov model; (D)
Markov random field.
Figure 9.19 Mining emails for meeting details. 406
Figure 9.20 (A)Dynamic Bayesian network representation of a 407 hidden Markov model; (B) similarly structured Markov
random field; (C) factor graph for (A); and (D) factor graph for a linear chain conditional random field.
Figure 10.1 A feedforward neural network. 424
Figure 10.2 Computation graph showing forward propagation in a 426
deep network.
Figure 10.3 Backpropagation in a deep network (the forward 429
computation is shown with gray arrows).
Figure 10.4
Parameter updates that follow the forward and backward 430
propagation steps (shown with gray arrows).
Figure 10.5 Typical learning curves for the training and
validation sets. 431
Figure 10.6 Pseudocode for mini-batch based stochastic gradient 435
descent.
Figure 10.7 Typical convolutional neural network architecture. 439
Figure 10.8 Original image; filtered with the two Sobel operators; 441
magnitude of the result.
Figure 10.9 Examples of what random neurons detect in different 442
layers of a convolutional neural network using the visualization approach of Zeiler and Fergus (2013).
Underlying imagery kindly provided by Matthew Zeiler.
Figure 10.10 Example of the convolution, pooling, and decimation 443
operations used in convolutional neural networks.
Figure 10.11 A simple autoencoder. 445
Figure 10.12 A deep autoencoder with multiple layers of 447
transformation.
Figure 10.13 Low-dimensional principal component space (left) 447
compared with one learned by a deep autoencoder (right).
Figure 10.14 Boltzmann machines: (A) fully connected; (B) restricted; 450
(C) more general form of (B).
Figure 10.15 (A) Deep Boltzmann machine and (B) deep belief 453
network.
Figure 10.16 (A) Feedforward network transformed into a recurrent 456
network; (B) hidden Markov model; and (C) recurrent
network obtained by unwrapping (A).
Figure 10.17 Structure of a “long short-term memory” unit. 459
Figure 10.18 Recurrent neural networks: (A) bidirectional, 460
(B) encoder-decoder.
Figure 10.19 A deep encoder-decoder recurrent network. 460
Figure 12.1 Algorithm for bagging. 483
Figure 12.2 Algorithm for boosting. 488
Figure 12.3 Algorithm for additive logistic regression. 493
Figure 12.4 Simple option tree for the weather data. 494
Figure 12.5 Alternating decision tree for the weather data. 495
Figure 13.1 A tangled web. 521
List of Tables
Table 1.1 The Contact Lens Data 7
Table 1.2 The Weather Data 11
Table 1.3 Weather Data With Some Numeric Attributes 12
Table 1.4 The Iris Data 15
Table 1.5 The CPU Performance Data 16
Table 1.6 The Labor Negotiations Data 17
Table 1.7 The Soybean Data 20
Table 2.1 Iris Data as a Clustering Problem 46
Table 2.2 Weather Data With a Numeric Class 47 Table 2.3 Family Tree 48
Table 2.4 The Sister-of Relation 49
Table 2.5 Another Relation 52
Table 3.1 A New Iris Flower 80
Table 3.2 Training Data for the Shapes Problem 83
Table 4.1 Evaluating the Attributes in the Weather Data 94
Table 4.2 The Weather Data, With Counts and Probabilities 97
Table 4.3 A New Day 98
Table 4.4 The Numeric Weather Data With Summary Statistics 101
Table 4.5 Another New Day 102
Table 4.6 The Weather Data with Identification Codes 111
Table 4.7 Gain Ratio Calculations for the Tree Stumps of Fig. 4.2 112
Table 4.8 Part of the Contact Lens Data for which Astigmatism = Yes 116
Table 4.9 Part of the Contact Lens Data for Which 117
Astigmatism 5 Yes and Tear Production Rate 5 Normal
Table 4.10 Item Sets for the Weather Data With Coverage 2 121
or Greater
Table 4.11 Association Rules for the Weather Data 123
Table 5.1 Confidence Limits for the Normal Distribution 166
Table 5.2 Confidence Limits for Student’s Distribution With 9 Degrees of Freedom 174
Table 5.3 Different Outcomes of a Two-Class Prediction 180
Table 5.4 Different Outcomes of a Three-Class Prediction: 181
(A) Actual; (B) Expected
Table 5.5 Default Cost Matrixes: (A) Two-Class Case; 182
(B) Three-Class Case
Table 5.6
Data for a Lift Chart 184
Table 5.7
Different Measures Used to Evaluate the False Positive 191
Versus False Negative Tradeoff
Table 5.8 Performance Measures for Numeric Prediction 195
Table 5.9
Performance Measures for Four Numeric Prediction Models 197
Table 6.1
Preparing the Weather Data for Insertion Into an FP-tree: 236
(A) The Original Data; (B) Frequency Ordering of Items
With Frequent Item Sets in Bold; (C) The Data With Each
Instance Sorted Into Frequency Order; (D) The Two
Multiple-Item Frequent Item Sets
Table 7.1
Linear Models in the Model Tree 280
Table 8.1
The First Five Instances From the CPU Performance Data; 308
(A) Original Values; (B) The First Partial Least Squares
Direction; (C) Residuals From the First Direction
Table 8.2
Transforming a Multiclass Problem Into a Two-Class One: 324
(A) Standard Method; (B) Error-Correcting Code
Table 8.3 A Nested Dichotomy in the Form of a Code Matrix 327
Table 9.1
Highest Probability Words and User Tags From a Sample 381
of Topics Extracted From a Collection of Scientific Articles
Table 9.2 Link Functions, Mean Functions, and Distributions Used in 401
Generalized Linear Models
Table 10.1 Summary of Performance on the MNIST Evaluation 421
Table 10.2 Loss Functions, Corresponding Distributions, and 423
Activation Functions
Table 10.3 Activation Functions and Their Derivatives 425
Table 10.4 Convolutional Neural Network Performance on the 439
ImageNet Challenge
Table 10.5 Components of a “Long Short-Term Memory” Recurrent 459
Neural Network
Table 13.1 The Top 10 Algorithms in Data Mining, According to a 504
2006 Poll