数据挖掘:实用机器学习技术及Java实现(英文版)
作者 : (新西兰)Jan H.Witten,Eide Frank
丛书名 : 经典原版书库
出版日期 : 2003-09-01
ISBN : 7-111-12769-2
定价 : 40.00元
教辅资源下载
扩展信息
语种 : 英文
页数 : 369
开本 : 16开
原书名 : Data Mining:Practical Machine Learning Tools and Techniques with Java Implementations
原出版社: Morgan Kaufmann Publishers
属性分类: 教材
包含CD :
绝版 :
图书简介

本书是综合运用数据挖掘、数据分析、信息理论以及机器学习技术的里程碑。
              ——微软研究院,图灵奖得主JimGray
  这是一本将数据挖掘算法和数据挖掘实践完美结合起来的优秀教材。作者以其丰富的经验,对数据挖掘的概念和数据挖掘所用的技术(特别是机器学习)进行了深入浅出的介绍,并对应用机器学习工具进行数据挖掘给出了良好的建议。数据挖掘中的各个关键要素也融合在众多实例中加以介绍。
  本书还介绍了Weka这种基于Java的软件系统。该软件系统可以用来分析数据集,找到适用的模式,进行正确的分析,也可以用来开发自己的机器学习方案。
  
  本书的主要特点:
  解释数据挖掘算法的原理。通过实例帮助读者根据实际情况选择合适的算法,并比较和评估不同方法得出的结果。 介绍提高性能的技术,包括数据处理以及组合不同方法得到的输出。
  提供了本书所用的Weka软件和附加学习材料,可以从http://www.mkp.com/datamining上下载这些资料。

图书特色

lan Witten is professor of computer science at the University of Waikato in Hamilton, New Zealand. He has taught at Essex University and at the University of Calgary, where he was head of computer science from 1982 to 1985. He holds degrees in matbematics from Cambridge University, computer science from the University of Calgary, and a Ph.D. in electrical engineering from Essex University, England. He has published extensively in academic conferences and jour-
nals on machine learning.
  The underlying theme of his current research is the exploitation of information about a user's past behavior to expedite interaction in the future. In pursuit of this theme; he has been drawn into machine learning, which seeks ways to summarize, restructure, and generalize past experience; adaptive text compres sion, that is, using information about past text to encode upcoming characters;
  and user modeling, which is the general area of characterizing user behavior.
  He directs a large project at Waikato on machine learning and its application to agriculture and has also been active recently in the area of document compression, indexing, and retrieval. He has also written many books over the last 15 years, the most recent of which is Managing Gigabytes: Compressing and Indexing Documents and Images, second edition (Morgan Kaufmann 1999) with A. Moffat and T. C. Bell.
Elbe Frank is a Ph.D. candidate in computer science at the University of Waikato. His research focus is machine learning. He holds a degree in Computer Science from the University of Karlstruhe in Germany and is the author of several papers presented at machine learning conferences and published in journals.

图书前言

The convergence of computing and communication has produced a society that feeds on information. Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then information is the set of patterns, or expectations, that underlie the data. There is a huge amount of information locked up in databases--information that is potentially important but has not yet been discovered or articulated. Our mission is to bring it forth.
  Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. Of course, there will be problems. Many patterns will be banal and uninteresting. Others will be spurious, contingent on accidental coincidences in the particular dataset used. And real data is imperfect: some parts are garbled, some missing. Anything that is discovered will be inexact: there will be exceptions to every rule and cases not covered by any rule. Algorithms need to be robust enough to cope with imperfect data and to extract regularities that are inexact but useful.
  Machine learning provides the technical basis of data mining. It is used to extract information from the raw data in databases--information that is expressed in a comprehensible form and can be used for a variety of purposes.
  The process is one of abstraction: taking the data, warts and all, and inferring whatever structure underlies it. This book is about the tools and techniques of machine learning that are used in practical data mining for finding, and describing, structural patterns in data.
  As with any burgeoning new technology that enjoys intense commercial attention, the use of data mining is surrounded by a great deal of hype in the technical--and sometimes the popular--press. Exaggerated reports appear of the secrets that can be uncovered by setting learning algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of simple and practical techniques that can often extract useful information from raw data. This book describes these techniques and shows how they work.
  We interpret machine learning as the acquisition of structural descriptions from examples. The kind of descriptions that are found can be used for prediction, explanation, and understanding. Some data mining applications focus on prediction: forecasting what will happen in new situations from data that describe what happened in the past, often by guessing the classification of new examples. But we are equally-perhaps more--interested in applications where the result of"learning" is an actual description of a structure that can be used to classify examples. This structural description supports explanation and understanding as well as prediction. In our experience, insights gained by the user are of most interest in the majority of practical data mining applications;indeed, this is one of machine learning's major advantages over classical statis-tical modeling.
  The book explains a wide variety of machine learning methods. Some are pedagogically motivated: simple schemes designed to explain dearly how the basic ideas work. Others are practical: real systems that are used in applications today. Many are contemporary and have been developed only in the last few years.
  A comprehensive software resource, written in the Java language, has been created to illustrate the ideas in the book. Called the Waikato Environment for Knowledge Analysis, or WekaI for short, this utility is available as source code on the World Wide Web via www. rnkp. corn/datarnining or at www. cs. waikato.ac. nz/rnl/weka. It is a full, industrial-strength implementation of essentially all he techniques that are covered in this book. It includes illustrative code and working implementations of machine learning methods. It offers dean, spare implementations of the simplest techniques, designed to aid understanding of the mechanisms involved. It also provides a workbench that includes full, working, state-of-the-art implementations of many popular learning schemes that can be used for practical data mining or for research. Finally, it contains a framework, in the form of a Java class library, that supports applications that use embedded machine learning and even the implementation of new learning schemes.
  The objective of this book is to introduce the tools and techniques for machine learning that are used in data mining. After reading it, you will understand what these techniques are and appreciate their strengths and applicability.
  If you wish to experiment with your own data, you will be able to do this with the Weka software. rithms they use work. It is often observed that data models are only as good as the person who interprets them, and that person needs to know something about how the models are produced in order to appreciate the strengths, and limitations, of the technology. However, it is not necessary for all users to have a deep understanding of the finer details of the algorithms.
  We address this situation by describing machine learning methods at successive levels of detail. The reader will learn the basic ideas, the topmost level, by reading the first three chapters. Chapter I describes, through examples, what machine learning is and where it can be used; it also provides actual practical applications: Chapters 2 and 3 cover the different kinds of input and output--or knowledge representation--that are involved. Different kinds of output dictate different styles of algorithm, and at the next level, Chapter 4 describes the basic methods of machine learning, simplified to make them easy to comprehend.
  Here the principles involved are conveyed in a variety of algorithms without getting bogged down in intricate details or tricky implementation issues. To make progress in the application of machine learning techniques to particular data mining problems, it is essential to be able to measure how well you are doing.
  Chapter 5, which can be read out of sequence, equips the reader to evaluate the results that are obtained from machine learning, addressing the sometimes complex issues involved in performance evaluation.
  At the lowest and most detailed level, Chapter 6 exposes in naked detail the nitty-gritty issues of implementing a spectrum of machine learning algorithms,including the complexities that are necessary for them to work well in practice.Although many readers may want to ignore this detailed information, it is at this level that the full, working, tested Java implementations of machine learning schemes are written. Chapter 7 discusses practical topics involved with engineering the input to machine learning--for example, selecting and discretizing attributes---and covers several more advanced techniques for refining and com-
bining the output from different learning techniques. Chapter 8 describes the Java code that accompanies the book. You can skip to this chapter directly from Chapter 4 ifyou are in a hurry to get on with analyzing your data and don't want to be bothered with the technicaldetails. Finally, Chapter 9 looks to the future.
  The book does not cover all machine learning methods. In particular, we do not discuss neural nets because this technique produces predictions rather than structural descriptions; also, it is well described in some recent books on data mining. Nor do we cover reinforcement learning since it is rarely applied in practical data mining; nor genetic algorithm approaches since these are really just an optimization technique; nor Bayesian networks because algorithms for learning them are not yet robust enough to be deployed; nor relational learning and inductive logic programming since they are rarely used in mainstream data mining applications.
  Java has been chosen for the implementations of machine learning techniques that accompany this book because, as an object-oriented programming language, it allows a uniform interface to learning schemes and methods for pre- and post-processing. We have chosen Java instead of C++, Smalltalk, or other object-oriented languages because programs written in Java can be run on almost any computer without having to be recompiled, or having to go through complicated installation procedures, or--worst of all-having to change the code itself. A Java program is compiled into byte-code that can be executed on any computer equipped with an appropriate interpreter. This interpreter is called the lava virtual machine. Java virtual machines--and, for that matter, Java compilers--are freely available for all important platforms.
  Like all widely used programming languages, Java has received its share of criticism. Although this is not the place to elaborate on such issues, in several cases the critics are clearly right. However, of all currently available program-ming languages that are widely supported, standardized, and extensively documented, Java seems to be the best choice for the purpose of this book. Its main disadvantage is speed of execution--or lack of it. Executing a Java program is several times slower than running a corresponding program written in C because the virtual machine has to translate the byte-code into machine code before it can be executed. In our experience the difference is a factor of three to five if the virtual machine uses a just-in-time compiler. Instead of translating each byte-code individually, a just-in-time compiler translates whole chunks of byte-code into machine code, thereby achieving significant speedup. However,if this is still too slow for your application, there are compilers that translate Java programs directly into machine code, bypassing the byte-code step. Of course,this code cannot be executed on other platforms, thereby sacrificing one of Java's most important advantages.


作者简介

(新西兰)Jan H.Witten,Eide Frank:Jan H.Witten:  新西兰怀卡托大学计算机科学系教授,ACM和新西兰皇家学会成员,曾荣获2004年国际信息处理研究协会 (IFIP) 颁发的Namur奖项。他的著作包括《Managing Gigabytes: Compressing and Indexing Documents and Images》、《How to Build a Digital Library》以及众多的期刊和学会文章。
Eide Frank:  新西兰怀卡托大学计算机科学系高级讲师,在机器学习领域有广泛的论文发表,是《Machine Learning Journal》和《Journal of Artificial Intelligence Research》的编委。他还是许多数据挖掘和机器学习学术会议设计委员会成员,也是本书附属的Weka机器学习软件的一位核心开发成员。

图书目录

Foreword vii
Preface  xvii
1 all about  1
1.1 Data mining and machine learning  2
Describing structural patterns  4
Machine learning 5
Data mining 7
1.2 Simple examples: The weather problem and others  8
The weather problem  8
Contact lenses: An idealized problem  11
Irises: A classic numeric dataset  13
CPU performance: Introducing numeric prediction  15
Labor negotiations: A more realistic example  16
Soybean classification: A classic machine learning success 17
1.3 Fielded applications  20
Decisions involving judgment 21
Screening images  22
Load forecasting  23
Diagnosis  24
Marketing and sales  25
1.4 Machine learning and statistics  26
1.5 Generalization as search 27
Enumerating the concept space  28
Bias 29
1.6 Data mining and ethics  32
1.7 Further reading  34
2 Input: Concepts, instances, attributes 37
2.1 What's a concept   38
2.2 What's in an example   41
2.3 What's in an attribute   45
2.4 Preparing the input 48
Gathering the data together 48
ARFF format  49
Attribute types 51
Missing values  52
Inaccurate values  53
Getting to know your data 54
2.5 Further reading  55
3 Output: Knowledge representation 57
3.1 Decision tables 58
3.2 Decision trees  58
3.3 Classification rules  59
3.4 Association rules  63
3.5 Rules with exceptions  64
3.6 Rules involving relations  67
3.7 Trees for numeric prediction  70
3.8 Instance-based representation 72
3.9 Clusters 75
3.10 Further reading  76
4 Algorithms: The basic method's 77
4.1 Inferring rudimentary rules  78
Missing values and numeric attributes  80
Discussion  81
4.2 Statistical modeling 82
Missing values and numeric attributes  85
Discussion  88
4.3 Divide and conquer. Constructing decision trees  89
Calculating information  93
Highly branching attributes  94
Discussion  97
4.4 Covering algorithms: Constructing rules  97
Rules versus trees  98
A simple covering algorithm  98
Rules versus decision lists  103
4.5 Mining association rules  104
Item sets  105
Association rules  105
Generating rules efficiently  108
Discussion  111
4.6 Linear models  112
Numeric prediction  112
Classification  113
Discussion  113
4.7 Instance-based learning  114
The distance function  114
Discussion  115
4.8 Further reading  116
5 Credibility: Evaluating what's been learned 119
5.1 Training and testing  120
5.2 Predicting performance 123
5.3 Cross-validation  125
5.4 Other estimates  127
Leave-one-out  127
The bootstrap  128
5.5 Comparing data mining schemes  129
5.6 Predicting probabilities  133
Quadratic loss function  134
Informational loss function  135
Discussion  136
5.7 Counting the cost  137
Lift charts  139
ROC curves  141
Cost-sensitive learning  144
Discussion  145
5.8 Evaluating numeric prediction  147
5.9 The minimum description length principle  150
5.10 Applying MDL to clustering 154
5.11 Further reading  155
6 Implementations: Real machine learning schemes 157
6.1 Decision trees  159
Numeric attributes  159
Missing values  161
Pruning  162
Estimating error rates  164
Complexity of decision tree induction  167
From trees to rules  168
C4.5: Choices and options  169
Discussion  169
6.2 Classification rules  170
Criteria for choosing tests  171
Missing values, numeric attributes  172
Good rules and bad rules  173
Generating good rules  174
Generating good decision lists  175
Probability measure for rule evaluation  177
Evaluating rules using a test set  178
Obtaining rules from partial decision trees  181
Rules with exceptions  184
Discussion  187
6.3 Extending linear dassification: Support vector achines188
The maximum margin hyperplane  189
Nonlinear class boundaries  191
Discussion  193
6.4 Instance-based learning 193
Reducing the number of exemplars  194
Pruning noisy exemplars  194
Weighting attributes  195
Generalizing exemplars  196
Distance functions for generalized exemplars  197
Generalized distance functions  199
Discussion  200
6.5 Numeric prediction  201
Model trees  202
Building the tree 202
Pruning the tree  203
Nominal attributes  204
Missing values  204
Pseudo-code for model tree induction  205
Locally weighted linear regression  208
Discussion  209
6.6 Clustering 210
Iterative distance-based clustering  211
Incremental clustering  212
Category utility  217
Probability-based clustering 218
The EM algorithm  221
Extending the mixture model  223
Bayesian clustering  225
Discussion  226
7 Moving on: Engineering die input and output 229
7.1 Attribute selection  232
Scheme-independent selection  233
Searching the attribute space 235
Scheme-specific selection  236
7.2 Discreti~ingnumeric attributes  238
Unsupervised discretization  239
Entropy-based discretization  240
Other discretization methods  243
Entropy-based versus error-based discretization  244
Converting discrete to numeric attributes  246
7.3 Automatic data deansing 247
Improving decision trees  247
Robust regression  248
Detecting anomalies  249
7.4 Combining multiple models  250
Bagging 251
Boosting 254
Stacking 258
Error-correcting output codes  260
7.5 Further reading  263
8 Nuts and bolts: Machine learning algorithms in Java 265
8.1 Getting started 267
8.2 Javadoc and the dass library  271
Classes, instances, and packages  272
The weka. core package  272
The weka. classifiers package 274
Other packages  276
Indexes  277
8.3 Processing datasets using the machine learning programs  277
Using M5' 277
Generic options  279
Scheme-specific options  282
Classifiers  283
Meta-learning schemes 286
Filters  289
Association rules  294
Clustering 296
8.4 Embedded machine learning 297
A simple message classifier  299
8.5 Writing new learning schemes  306
An example classifier  306
Conventions for implementing classifiers 314
Writing filters 314
An example filter  316
Conventions for writing filters 317
9 Looking forward 321
9.1 Learning from massive datasets  322
9.2 Visualizing machine learning  325
Visualizing the input 325
Visualizing the output  327
9.3 Incorporating domain knowledge  329
9.4 Text mining 331
Finding key phrases for documents 331
Finding information in running text 333
Soft parsing 334
9.5 Mining the World Wide Web  335
9.6 Further reading  336
References 339
Index 351
About the authors 371

教学资源推荐
作者: Abraham Silberschatz, Henry F.Korth, S.Sudarshan
作者: 张玉洁 孟祥武 编著
作者: [美]迪卫艾肯特?阿格拉沃尔(Divyakant Agrawal) 苏迪皮托?达斯(Sudipto Das) 阿姆鲁?埃尔?阿巴迪(Amr El Abbadi) 著