本书是综合运用数据挖掘、数据分析、信息理论以及机器学习技术的里程碑。
——微软研究院,图灵奖得主JimGray
这是一本将数据挖掘算法和数据挖掘实践完美结合起来的优秀教材。作者以其丰富的经验,对数据挖掘的概念和数据挖掘所用的技术(特别是机器学习)进行了深入浅出的介绍,并对应用机器学习工具进行数据挖掘给出了良好的建议。数据挖掘中的各个关键要素也融合在众多实例中加以介绍。
本书还介绍了Weka这种基于Java的软件系统。该软件系统可以用来分析数据集,找到适用的模式,进行正确的分析,也可以用来开发自己的机器学习方案。
本书的主要特点:
解释数据挖掘算法的原理。通过实例帮助读者根据实际情况选择合适的算法,并比较和评估不同方法得出的结果。 介绍提高性能的技术,包括数据处理以及组合不同方法得到的输出。
提供了本书所用的Weka软件和附加学习材料,可以从http://www.mkp.com/datamining上下载这些资料。
lan Witten is professor of computer science at the University of Waikato in Hamilton, New Zealand. He has taught at Essex University and at the University of Calgary, where he was head of computer science from 1982 to 1985. He holds degrees in matbematics from Cambridge University, computer science from the University of Calgary, and a Ph.D. in electrical engineering from Essex University, England. He has published extensively in academic conferences and jour-
nals on machine learning.
The underlying theme of his current research is the exploitation of information about a user's past behavior to expedite interaction in the future. In pursuit of this theme; he has been drawn into machine learning, which seeks ways to summarize, restructure, and generalize past experience; adaptive text compres sion, that is, using information about past text to encode upcoming characters;
and user modeling, which is the general area of characterizing user behavior.
He directs a large project at Waikato on machine learning and its application to agriculture and has also been active recently in the area of document compression, indexing, and retrieval. He has also written many books over the last 15 years, the most recent of which is Managing Gigabytes: Compressing and Indexing Documents and Images, second edition (Morgan Kaufmann 1999) with A. Moffat and T. C. Bell.
Elbe Frank is a Ph.D. candidate in computer science at the University of Waikato. His research focus is machine learning. He holds a degree in Computer Science from the University of Karlstruhe in Germany and is the author of several papers presented at machine learning conferences and published in journals.
The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databases--information that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth.
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.
Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases--information that is expressed in a comprehensible form and can be used for a variety of purposes.
The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning that are used in practical data mining for finding, and describing, structural patterns in data.
As with any burgeoning new technology that enjoys intense commercial attention, the use of data mining is surrounded by a great deal of hype in the technical--and sometimes the popular--press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techniques and shows how they work.
We interpret machine learning as the acquisition of structural descriptions from examples. The kind of descriptions that are found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally-perhaps more--interested in applications where the result of"learning" is an actual description of a structure that can be used to classify examples. This structural description supports explanation and understanding as well as prediction. In our experience, insights gained by the user are of most interest in the majority of practical data mining applications;indeed, this is one of machine learning's major advantages over classical statis-tical modeling.
The book explains a wide variety of machine learning methods. Some are pedagogically motivated: simple schemes designed to explain dearly how the basic ideas work. Others are practical: real systems that are used in applications today. Many are contemporary and have been developed only in the last few years.
A comprehensive software resource, written in the Java language, has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or WekaI for short, this utility is available as source code on the World Wide Web via www. rnkp. corn/datarnining or at www. cs. waikato.ac. nz/rnl/weka. It is a full, industrial-strength implementation of essentially all he techniques that are covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers dean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.
The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.
If you wish to experiment with your own data, you will be able to do this with the Weka software. rithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced in order to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all users to have a deep understanding of the finer details of the algorithms.
We address this situation by describing machine learning methods at successive levels of detail. The reader will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter I describes, through examples, what machine learning is and where it can be used; it also provides actual practical applications: Chapters 2 and 3 cover the different kinds of input and output--or knowledge representation--that are involved. Different kinds of output dictate different styles of algorithm, and at the next level, Chapter 4 describes the basic methods of machine learning, simplified to make them easy to comprehend.
Here the principles involved are conveyed in a variety of algorithms without getting bogged down in intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing.
Chapter 5, which can be read out of sequence, equips the reader to evaluate the results that are obtained from machine learning, addressing the sometimes complex issues involved in performance evaluation.
At the lowest and most detailed level, Chapter 6 exposes in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms,including the complexities that are necessary for them to work well in practice.Although many readers may want to ignore this detailed information, it is at this level that the full, working, tested Java implementations of machine learning schemes are written. Chapter 7 discusses practical topics involved with engineering the input to machine learning--for example, selecting and discretizing attributes---and covers several more advanced techniques for refining and com-
bining the output from different learning techniques. Chapter 8 describes the Java code that accompanies the book. You can skip to this chapter directly from Chapter 4 ifyou are in a hurry to get on with analyzing your data and don't want to be bothered with the technicaldetails. Finally, Chapter 9 looks to the future.
The book does not cover all machine learning methods. In particular, we do not discuss neural nets because this technique produces predictions rather than structural descriptions; also, it is well described in some recent books on data mining. Nor do we cover reinforcement learning since it is rarely applied in practical data mining; nor genetic algorithm approaches since these are really just an optimization technique; nor Bayesian networks because algorithms for learning them are not yet robust enough to be deployed; nor relational learning and inductive logic programming since they are rarely used in mainstream data mining applications.
Java has been chosen for the implementations of machine learning techniques that accompany this book because, as an object-oriented programming language, it allows a uniform interface to learning schemes and methods for pre- and post-processing. We have chosen Java instead of C++, Smalltalk, or other object-oriented languages because programs written in Java can be run on almost any computer without having to be recompiled, or having to go through complicated installation procedures, or--worst of all-having to change the code itself. A Java program is compiled into byte-code that can be executed on any computer equipped with an appropriate interpreter. This interpreter is called the lava virtual machine. Java virtual machines--and, for that matter, Java compilers--are freely available for all important platforms.
Like all widely used programming languages, Java has received its share of criticism. Although this is not the place to elaborate on such issues, in several cases the critics are clearly right. However, of all currently available program-ming languages that are widely supported, standardized, and extensively documented, Java seems to be the best choice for the purpose of this book. Its main disadvantage is speed of execution--or lack of it. Executing a Java program is several times slower than running a corresponding program written in C because the virtual machine has to translate the byte-code into machine code before it can be executed. In our experience the difference is a factor of three to five if the virtual machine uses a just-in-time compiler. Instead of translating each byte-code individually, a just-in-time compiler translates whole chunks of byte-code into machine code, thereby achieving significant speedup. However,if this is still too slow for your application, there are compilers that translate Java programs directly into machine code, bypassing the byte-code step. Of course,this code cannot be executed on other platforms, thereby sacrificing one of Java's most important advantages.
(新西兰)Jan H.Witten,Eide Frank:Jan H.Witten: 新西兰怀卡托大学计算机科学系教授,ACM和新西兰皇家学会成员,曾荣获2004年国际信息处理研究协会 (IFIP) 颁发的Namur奖项。他的著作包括《Managing Gigabytes: Compressing and Indexing Documents and Images》、《How to Build a Digital Library》以及众多的期刊和学会文章。
Eide Frank: 新西兰怀卡托大学计算机科学系高级讲师,在机器学习领域有广泛的论文发表,是《Machine Learning Journal》和《Journal of Artificial Intelligence Research》的编委。他还是许多数据挖掘和机器学习学术会议设计委员会成员,也是本书附属的Weka机器学习软件的一位核心开发成员。
Foreword vii
Preface xvii
1 all about 1
1.1 Data mining and machine learning 2
Describing structural patterns 4
Machine learning 5
Data mining 7
1.2 Simple examples: The weather problem and others 8
The weather problem 8
Contact lenses: An idealized problem 11
Irises: A classic numeric dataset 13
CPU performance: Introducing numeric prediction 15
Labor negotiations: A more realistic example 16
Soybean classification: A classic machine learning success 17
1.3 Fielded applications 20
Decisions involving judgment 21
Screening images 22
Load forecasting 23
Diagnosis 24
Marketing and sales 25
1.4 Machine learning and statistics 26
1.5 Generalization as search 27
Enumerating the concept space 28
Bias 29
1.6 Data mining and ethics 32
1.7 Further reading 34
2 Input: Concepts, instances, attributes 37
2.1 What's a concept 38
2.2 What's in an example 41
2.3 What's in an attribute 45
2.4 Preparing the input 48
Gathering the data together 48
ARFF format 49
Attribute types 51
Missing values 52
Inaccurate values 53
Getting to know your data 54
2.5 Further reading 55
3 Output: Knowledge representation 57
3.1 Decision tables 58
3.2 Decision trees 58
3.3 Classification rules 59
3.4 Association rules 63
3.5 Rules with exceptions 64
3.6 Rules involving relations 67
3.7 Trees for numeric prediction 70
3.8 Instance-based representation 72
3.9 Clusters 75
3.10 Further reading 76
4 Algorithms: The basic method's 77
4.1 Inferring rudimentary rules 78
Missing values and numeric attributes 80
Discussion 81
4.2 Statistical modeling 82
Missing values and numeric attributes 85
Discussion 88
4.3 Divide and conquer. Constructing decision trees 89
Calculating information 93
Highly branching attributes 94
Discussion 97
4.4 Covering algorithms: Constructing rules 97
Rules versus trees 98
A simple covering algorithm 98
Rules versus decision lists 103
4.5 Mining association rules 104
Item sets 105
Association rules 105
Generating rules efficiently 108
Discussion 111
4.6 Linear models 112
Numeric prediction 112
Classification 113
Discussion 113
4.7 Instance-based learning 114
The distance function 114
Discussion 115
4.8 Further reading 116
5 Credibility: Evaluating what's been learned 119
5.1 Training and testing 120
5.2 Predicting performance 123
5.3 Cross-validation 125
5.4 Other estimates 127
Leave-one-out 127
The bootstrap 128
5.5 Comparing data mining schemes 129
5.6 Predicting probabilities 133
Quadratic loss function 134
Informational loss function 135
Discussion 136
5.7 Counting the cost 137
Lift charts 139
ROC curves 141
Cost-sensitive learning 144
Discussion 145
5.8 Evaluating numeric prediction 147
5.9 The minimum description length principle 150
5.10 Applying MDL to clustering 154
5.11 Further reading 155
6 Implementations: Real machine learning schemes 157
6.1 Decision trees 159
Numeric attributes 159
Missing values 161
Pruning 162
Estimating error rates 164
Complexity of decision tree induction 167
From trees to rules 168
C4.5: Choices and options 169
Discussion 169
6.2 Classification rules 170
Criteria for choosing tests 171
Missing values, numeric attributes 172
Good rules and bad rules 173
Generating good rules 174
Generating good decision lists 175
Probability measure for rule evaluation 177
Evaluating rules using a test set 178
Obtaining rules from partial decision trees 181
Rules with exceptions 184
Discussion 187
6.3 Extending linear dassification: Support vector achines188
The maximum margin hyperplane 189
Nonlinear class boundaries 191
Discussion 193
6.4 Instance-based learning 193
Reducing the number of exemplars 194
Pruning noisy exemplars 194
Weighting attributes 195
Generalizing exemplars 196
Distance functions for generalized exemplars 197
Generalized distance functions 199
Discussion 200
6.5 Numeric prediction 201
Model trees 202
Building the tree 202
Pruning the tree 203
Nominal attributes 204
Missing values 204
Pseudo-code for model tree induction 205
Locally weighted linear regression 208
Discussion 209
6.6 Clustering 210
Iterative distance-based clustering 211
Incremental clustering 212
Category utility 217
Probability-based clustering 218
The EM algorithm 221
Extending the mixture model 223
Bayesian clustering 225
Discussion 226
7 Moving on: Engineering die input and output 229
7.1 Attribute selection 232
Scheme-independent selection 233
Searching the attribute space 235
Scheme-specific selection 236
7.2 Discreti~ingnumeric attributes 238
Unsupervised discretization 239
Entropy-based discretization 240
Other discretization methods 243
Entropy-based versus error-based discretization 244
Converting discrete to numeric attributes 246
7.3 Automatic data deansing 247
Improving decision trees 247
Robust regression 248
Detecting anomalies 249
7.4 Combining multiple models 250
Bagging 251
Boosting 254
Stacking 258
Error-correcting output codes 260
7.5 Further reading 263
8 Nuts and bolts: Machine learning algorithms in Java 265
8.1 Getting started 267
8.2 Javadoc and the dass library 271
Classes, instances, and packages 272
The weka. core package 272
The weka. classifiers package 274
Other packages 276
Indexes 277
8.3 Processing datasets using the machine learning programs 277
Using M5' 277
Generic options 279
Scheme-specific options 282
Classifiers 283
Meta-learning schemes 286
Filters 289
Association rules 294
Clustering 296
8.4 Embedded machine learning 297
A simple message classifier 299
8.5 Writing new learning schemes 306
An example classifier 306
Conventions for implementing classifiers 314
Writing filters 314
An example filter 316
Conventions for writing filters 317
9 Looking forward 321
9.1 Learning from massive datasets 322
9.2 Visualizing machine learning 325
Visualizing the input 325
Visualizing the output 327
9.3 Incorporating domain knowledge 329
9.4 Text mining 331
Finding key phrases for documents 331
Finding information in running text 333
Soft parsing 334
9.5 Mining the World Wide Web 335
9.6 Further reading 336
References 339
Index 351
About the authors 371