教学资源 – 图书教辅

扩展信息

语种 : 英文

页数 : 369

开本 : 16开

原书名 : Data Mining：Practical Machine Learning Tools and Techniques with Java Implementations

原出版社: Morgan Kaufmann Publishers

属性分类: 教材

包含CD : 无

绝版 : 无

图书简介

本书是综合运用数据挖掘、数据分析、信息理论以及机器学习技术的里程碑。
　　　　　　　　　　　　　 ——微软研究院，图灵奖得主JimGray
　　这是一本将数据挖掘算法和数据挖掘实践完美结合起来的优秀教材。作者以其丰富的经验，对数据挖掘的概念和数据挖掘所用的技术(特别是机器学习)进行了深入浅出的介绍，并对应用机器学习工具进行数据挖掘给出了良好的建议。数据挖掘中的各个关键要素也融合在众多实例中加以介绍。
　　本书还介绍了Weka这种基于Java的软件系统。该软件系统可以用来分析数据集，找到适用的模式，进行正确的分析，也可以用来开发自己的机器学习方案。
　　
　　本书的主要特点：
　　解释数据挖掘算法的原理。通过实例帮助读者根据实际情况选择合适的算法，并比较和评估不同方法得出的结果。介绍提高性能的技术，包括数据处理以及组合不同方法得到的输出。
　　提供了本书所用的Weka软件和附加学习材料，可以从http://www.mkp.com/datamining上下载这些资料。

图书特色

lan Witten is professor of computer science at the University of Waikato in Hamilton, New Zealand. He has taught at Essex University and at the University of Calgary, where he was head of computer science from 1982 to 1985. He holds degrees in matbematics from Cambridge University, computer science from the University of Calgary, and a Ph.D. in electrical engineering from Essex University, England. He has published extensively in academic conferences and jour-
nals on machine learning.
　 The underlying theme of his current research is the exploitation of information about a user's past behavior to expedite interaction in the future. In pursuit of this theme; he has been drawn into machine learning, which seeks ways to summarize, restructure, and generalize past experience; adaptive text compres sion, that is, using information about past text to encode upcoming characters;
　 and user modeling, which is the general area of characterizing user behavior.
　 He directs a large project at Waikato on machine learning and its application to agriculture and has also been active recently in the area of document compression, indexing, and retrieval. He has also written many books over the last 15 years, the most recent of which is Managing Gigabytes: Compressing and Indexing Documents and Images, second edition (Morgan Kaufmann 1999) with A. Moffat and T. C. Bell.
Elbe Frank is a Ph.D. candidate in computer science at the University of Waikato. His research focus is machine learning. He holds a degree in Computer Science from the University of Karlstruhe in Germany and is the author of several papers presented at machine learning conferences and published in journals.

图书前言

The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw　form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databases--information that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth.
　　Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.
　　Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases--information that is expressed in a comprehensible form and can be used for a variety of purposes.
　　The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning that are used in practical data mining for finding, and describing, structural patterns in data.
　　As with any burgeoning new technology that enjoys intense commercial attention, the use of data mining is surrounded by a great deal of hype in the technical--and sometimes the popular--press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techniques and shows how they work.
　　We interpret machine learning as the acquisition of structural descriptions from examples. The kind of descriptions that are found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally-perhaps more--interested in applications where the result of"learning" is an actual description of a structure that can be used to classify examples. This structural description supports explanation and understanding as well as prediction. In our experience, insights gained by the user are of most interest in the majority of practical data mining applications;indeed, this is one of machine learning's major advantages over classical statis-tical modeling.
　　The book explains a wide variety of machine learning methods. Some are pedagogically motivated: simple schemes designed to explain dearly how the basic ideas work. Others are practical: real systems that are used in applications today. Many are contemporary and have been developed only in the last few years.
　　A comprehensive software resource, written in the Java language, has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or WekaI for short, this utility is available as source code on the World Wide Web via www. rnkp. corn/datarnining or at www. cs. waikato.ac. nz/rnl/weka. It is a full, industrial-strength implementation of essentially all he techniques that are covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers dean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.
　　The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.
　　If you wish to experiment with your own data, you will be able to do this with the Weka software. rithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced in order to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all users to have a deep understanding of the finer details of the algorithms.
　　We address this situation by describing machine learning methods at successive levels of detail. The reader will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter I describes, through examples, what machine learning is and where it can be used; it also provides actual practical applications: Chapters 2 and 3 cover the different kinds of input and output--or knowledge representation--that are involved. Different kinds of output dictate different styles of algorithm, and at the next level, Chapter 4 describes the basic methods of machine learning, simplified to make them easy to comprehend.
　　Here the principles involved are conveyed in a variety of algorithms without getting bogged down in intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing.
　　Chapter 5, which can be read out of sequence, equips the reader to evaluate the results that are obtained from machine learning, addressing the sometimes complex issues involved in performance evaluation.
　　At the lowest and most detailed level, Chapter 6 exposes in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms,including the complexities that are necessary for them to work well in practice.Although many readers may want to ignore this detailed information, it is at this level that the full, working, tested Java implementations of machine learning schemes are written. Chapter 7 discusses practical topics involved with engineering the input to machine learning--for example, selecting and discretizing attributes---and covers several more advanced techniques for refining and com-
bining the output from different learning techniques. Chapter 8 describes the Java code that accompanies the book. You can skip to this chapter directly from Chapter 4 ifyou are in a hurry to get on with analyzing your data and don't want to be bothered with the technicaldetails. Finally, Chapter 9 looks to the future.
　　The book does not cover all machine learning methods. In particular, we do not discuss neural nets because this technique produces predictions rather than structural descriptions; also, it is well described in some recent books on data mining. Nor do we cover reinforcement learning since it is rarely applied in practical data mining; nor genetic algorithm approaches since these are really just an optimization technique; nor Bayesian networks because algorithms for learning them are not yet robust enough to be deployed; nor relational learning and inductive logic programming since they are rarely used in mainstream data mining applications.
　　Java has been chosen for the implementations of machine learning techniques that accompany this book because, as an object-oriented programming language, it allows a uniform interface to learning schemes and methods for pre- and post-processing. We have chosen Java instead of C++, Smalltalk, or other object-oriented languages because programs written in Java can be run on almost any computer without having to be recompiled, or having to go through complicated installation procedures, or--worst of all-having to change the code itself. A Java program is compiled into byte-code that can be executed on any computer equipped with an appropriate interpreter. This interpreter is called the lava virtual machine. Java virtual machines--and, for that matter, Java compilers--are freely available for all important platforms.
　　Like all widely used programming languages, Java has received its share of criticism. Although this is not the place to elaborate on such issues, in several cases the critics are clearly right. However, of all currently available program-ming languages that are widely supported, standardized, and extensively documented, Java seems to be the best choice for the purpose of this book. Its main disadvantage is speed of execution--or lack of it. Executing a Java program is several times slower than running a corresponding program written in C because the virtual machine has to translate the byte-code into machine code before it can be executed. In our experience the difference is a factor of three to five if the virtual machine uses a just-in-time compiler. Instead of translating each byte-code individually, a just-in-time compiler translates whole chunks of byte-code into machine code, thereby achieving significant speedup. However,if this is still too slow for your application, there are compilers that translate Java programs directly into machine code, bypassing the byte-code step. Of course,this code cannot be executed on other platforms, thereby sacrificing one of Java's most important advantages.

作者简介

(新西兰)Jan H.Witten,Eide Frank：Jan H.Witten:　　新西兰怀卡托大学计算机科学系教授，ACM和新西兰皇家学会成员，曾荣获2004年国际信息处理研究协会 (IFIP) 颁发的Namur奖项。他的著作包括《Managing Gigabytes: Compressing and Indexing Documents and Images》、《How to Build a Digital Library》以及众多的期刊和学会文章。
Eide Frank:　　新西兰怀卡托大学计算机科学系高级讲师，在机器学习领域有广泛的论文发表，是《Machine Learning Journal》和《Journal of Artificial Intelligence Research》的编委。他还是许多数据挖掘和机器学习学术会议设计委员会成员，也是本书附属的Weka机器学习软件的一位核心开发成员。

图书目录

Foreword　vii
Preface　 xvii
1 all about 　1
1.1　Data mining and machine learning　 2
Describing structural patterns　 4
Machine learning　5
Data mining　7
1.2　Simple examples: The weather problem and others　 8
The weather problem　 8
Contact lenses: An idealized problem　 11
Irises: A classic numeric dataset　 13
CPU performance: Introducing numeric prediction　 15
Labor negotiations: A more realistic example　 16
Soybean classification: A classic machine learning success　17
1.3　Fielded applications　 20
Decisions involving judgment　21
Screening images　 22
Load forecasting　 23
Diagnosis　 24
Marketing and sales　 25
1.4　Machine learning and statistics　 26
1.5　Generalization as search　27
Enumerating the concept space　 28
Bias　29
1.6　Data mining and ethics　 32
1.7　Further reading　 34
2 Input: Concepts, instances, attributes　37
2.1　What's a concept 　 38
2.2　What's in an example 　 41
2.3　What's in an attribute 　 45
2.4 Preparing the input　48
Gathering the data together　48
ARFF format　 49
Attribute types　51
Missing values　 52
Inaccurate values　 53
Getting to know your data　54
2.5　Further reading　 55
3 Output: Knowledge representation 57
3.1　Decision tables　58
3.2　Decision trees　 58
3.3　Classification rules　 59
3.4　Association rules　 63
3.5　Rules with exceptions　 64
3.6　Rules involving relations　 67
3.7　Trees for numeric prediction　 70
3.8　Instance-based representation　72
3.9　Clusters　75
3.10　Further reading　 76
4 Algorithms: The basic method's　77
4.1　Inferring rudimentary rules　 78
Missing values and numeric attributes　 80
Discussion　 81
4.2　Statistical modeling　82
Missing values and numeric attributes　 85
Discussion　 88
4.3　Divide and conquer. Constructing decision trees　 89
Calculating information　 93
Highly branching attributes　 94
Discussion　 97
4.4　Covering algorithms: Constructing rules　 97
Rules versus trees　 98
A simple covering algorithm　 98
Rules versus decision lists　 103
4.5　Mining association rules　 104
Item sets　 105
Association rules　 105
Generating rules efficiently　 108
Discussion　 111
4.6　Linear models　 112
Numeric prediction　 112
Classification　 113
Discussion　 113
4.7　Instance-based learning　 114
The distance function　 114
Discussion　 115
4.8　Further reading　 116
5 Credibility: Evaluating what's been learned　119
5.1　Training and testing　 120
5.2　Predicting performance　123
5.3　Cross-validation　 125
5.4　Other estimates　 127
Leave-one-out　 127
The bootstrap　 128
5.5　Comparing data mining schemes　 129
5.6　Predicting probabilities　 133
Quadratic loss function　 134
Informational loss function　 135
Discussion　 136
5.7　Counting the cost　 137
Lift charts　 139
ROC curves　 141
Cost-sensitive learning　 144
Discussion　 145
5.8　Evaluating numeric prediction　 147
5.9　The minimum description length principle　 150
5.10 Applying MDL to clustering　154
5.11　Further reading　 155
6 Implementations: Real machine learning schemes　157
6.1　Decision trees　 159
Numeric attributes　 159
Missing values　 161
Pruning　 162
Estimating error rates　 164
Complexity of decision tree induction　 167
From trees to rules　 168
C4.5: Choices and options　 169
Discussion　 169
6.2　Classification rules　 170
Criteria for choosing tests　 171
Missing values, numeric attributes　 172
Good rules and bad rules　 173
Generating good rules　 174
Generating good decision lists　 175
Probability measure for rule evaluation　 177
Evaluating rules using a test set　 178
Obtaining rules from partial decision trees　 181
Rules with exceptions　 184
Discussion　 187
6.3　Extending linear dassification: Support vector achines188
The maximum margin hyperplane　 189
Nonlinear class boundaries　 191
Discussion　 193
6.4 Instance-based learning　193
Reducing the number of exemplars　 194
Pruning noisy exemplars　 194
Weighting attributes　 195
Generalizing exemplars　 196
Distance functions for generalized exemplars　 197
Generalized distance functions　 199
Discussion　 200
6.5　Numeric prediction　 201
Model trees　 202
Building the tree　202
Pruning the tree　 203
Nominal attributes　 204
Missing values　 204
Pseudo-code for model tree induction　 205
Locally weighted linear regression　 208
Discussion　 209
6.6　Clustering　210
Iterative distance-based clustering　 211
Incremental clustering　 212
Category utility　 217
Probability-based clustering　218
The EM algorithm　 221
Extending the mixture model　 223
Bayesian clustering　 225
Discussion　 226
7 Moving on: Engineering die input and output　229
7.1　Attribute selection　 232
Scheme-independent selection　 233
Searching the attribute space　235
Scheme-specific selection　 236
7.2　Discreti~ingnumeric attributes　 238
Unsupervised discretization　 239
Entropy-based discretization　 240
Other discretization methods　 243
Entropy-based versus error-based discretization　 244
Converting discrete to numeric attributes　 246
7.3　Automatic data deansing　247
Improving decision trees　 247
Robust regression　 248
Detecting anomalies　 249
7.4　Combining multiple models　 250
Bagging　251
Boosting　254
Stacking　258
Error-correcting output codes　 260
7.5　Further reading　 263
8 Nuts and bolts: Machine learning algorithms in Java　265
8.1　Getting started　267
8.2　Javadoc and the dass library　 271
Classes, instances, and packages　 272
The weka. core package　 272
The weka. classifiers package 274
Other packages　 276
Indexes　 277
8.3　Processing datasets using the machine learning programs　 277
Using M5'　277
Generic options　 279
Scheme-specific options　 282
Classifiers　 283
Meta-learning schemes　286
Filters　 289
Association rules　 294
Clustering　296
8.4　Embedded machine learning　297
A simple message classifier　 299
8.5　Writing new learning schemes　 306
An example classifier　 306
Conventions for implementing classifiers　314
Writing filters　314
An example filter　 316
Conventions for writing filters 317
9 Looking forward　321
9.1　Learning from massive datasets　 322
9.2　Visualizing machine learning　 325
Visualizing the input　325
Visualizing the output　 327
9.3　Incorporating domain knowledge　 329
9.4　Text mining　331
Finding key phrases for documents　331
Finding information in running text　333
Soft parsing　334
9.5　Mining the World Wide Web　 335
9.6　Further reading　 336
References　339
Index　351
About the authors　371