本书是Web挖掘与搜索引擎领域的经典著作,自出版以来深受好评,已经被斯坦福、普林斯顿、卡内基梅隆等世界名校采用为教材。书中首先介绍了Web爬行和搜索等许多基础性的问题,并以此为基础,深入阐述了解决Web挖掘各种难题所涉及的机器学习技术,提出了机器学习在系统获取、存储和分析数据中的许多应用,并探讨了这些应用的优劣和发展前景。
全书分析透彻,富于前瞻性,为构建Web挖掘创新性应用奠定了理论和实践基础,既适用于信息检索和机器学习领域的研究人员和高校师生,也是广大Web开发人员的优秀参考书。
网站首页 软件下载 游戏下载 翻译软件 电子书下载 电影下载 电视剧下载 教程攻略
书名 | Web数据挖掘(超文本数据的知识发现英文版)/图灵原版计算机科学系列 |
分类 | |
作者 | (印度)查凯莱巴蒂 |
出版社 | 人民邮电出版社 |
下载 | ![]() |
简介 | 编辑推荐 本书是Web挖掘与搜索引擎领域的经典著作,自出版以来深受好评,已经被斯坦福、普林斯顿、卡内基梅隆等世界名校采用为教材。书中首先介绍了Web爬行和搜索等许多基础性的问题,并以此为基础,深入阐述了解决Web挖掘各种难题所涉及的机器学习技术,提出了机器学习在系统获取、存储和分析数据中的许多应用,并探讨了这些应用的优劣和发展前景。 全书分析透彻,富于前瞻性,为构建Web挖掘创新性应用奠定了理论和实践基础,既适用于信息检索和机器学习领域的研究人员和高校师生,也是广大Web开发人员的优秀参考书。 内容推荐 本书是信息检索领域的名著,深入讲解了从大量非结构化Web数据中提取和产生知识的技术。书中首先论述了Web的基础(包括Web信息采集机制、Web标引机制以及基于关键字或基于相似性搜索机制),然后系统地描述了Web挖掘的基础知识,着重介绍基于超文本的机器学习和数据挖掘方法,如聚类、协同过滤、监督学习、半监督学习,最后讲述了这些基本原理在Web挖掘中的应用。本书为读者提供了坚实的技术背景和最新的知识。 本书是从事数据挖掘学术研究和开发的专业人员理想的参考书,同时也适合作为高等院校计算机及相关专业研究生的教材。 目录 INTRODUCTION 1.1 Crawling and Indexing 6 1.2 Topic Directories 7 1.3 Clustering and Classification 8 1.4 Hyperlink Analysis 9 1.5 Resource Discovery and Vertical Portals 11 1.6 Structured vs. Unstructured Data Mining 11 1.7 Bibliographic Notes 13 PART Ⅰ INFRASTRUCTURE 2 CRAWLING THE WEB 2.1 HTML and HTTP Basics 18 2.2 Crawling Basics 19 2.3 Engineering Large-Scale Crawlers 21 2.3.1 DNS Caching, Prefetching, and Resolution 22 2.3.2 Multiple Concurrent Fetches 23 2.3.3 Link Extraction and Normalization 25 2.3.4 Robot Exclusion 26 2.3.5 Eliminating Already-Visited URLs 26 2.3.6 Spider Traps 28 2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages 29 2.3.8 Load Monitor and Manager 29 2.3.9 Per-Server Work-Queues 30 2.3.10 Text Repository 31 2.3.11 Refreshing Crawled Pages 33 2.4 Putting Together a Crawler 35 2.4.1 Design of the Core Components 35 2.4.2 Case Study: Using w3c-libwww 40 2.5 Bibliographic Notes 40 3 WEB SEARCH AND INFORMATION RETRIEVAL 3.1 Boolean Queries and the Inverted Index 45 3.1.1 Stopwords and Stemming 48 3.1.2 Batch Indexing and Updates 49 3.1.3 Index Compression Techniques 51 3.2 Relevance Ranking 53 3.2.1 Recall and Precision 53 3.2.2 The Vector-Space Model 56 3.2.3 Relevance Feedback and Rocchio's Method 57 3.2.4 Probabilistic Relevance Feedback Models 58 3.2.5 Advanced Issues 61 3.3 Similarity Search 67 3.3.1 Handling"Find-Similar" Queries 68 3.3.2 Eliminating Near Duplicates via Shingling 71 3.3.3 Detecting Locally Similar Subgraphs of the Web 73 3.4 Bibliographic Notes 75 PART Ⅱ LEARNING 4 SIMILARITY AND CLUSTERING 4.1 Formulations and Approaches 81 4.1.1 Partitioning Approaches 81 4.1.2 Geometric Embedding Approaches 82 4.1.3 Generative Models and Probabilistic Approaches 83 4.2 Bottom-Up and Top-Down Partitioning Paradigms 84 4.2.1 Agglomerative Clustering 84 4.2.2 The k-Means Algorithm 87 4.3 Clustering and Visualization via Embeddings 89 4.3.1 Self-Organizing Maps (SOMs) 90 4.3.2 Multidimensional Scaling (MDS) and FastMap 91 4.3.3 Projections and Subspaces 94 4.3.4 Latent Semantic Indexing (LSI) 96 4.4 Probabilistic Approaches to Clnstermg 99 4.4.1 Generative Distributions for Documents 101 4.4.2 Mixture Models and Expectation Maximization (EM) 103 4.4.3 Multiple Cause Mixture Model (MCMM) 108 4.4.4 Aspect Models and Probabilistic LSI 109 4.4.5 Model and Feature Selection 112 4.5 Collaborative Filtering 115 4.5.1 Probabilistic Models 115 4.5.2 Combining Content-Based and Collaborative Features 117 4.6 Bibliographic Notes 121 5 SUPERVISED LEARNING 5.1 The Supervised Learning Scenario 126 5.2 Overview of Classification Strategies 128 5.3 Evaluating Text Classifiers 129 5.3.1 Benchmarks 130 5.3.2 Measures of Accuracy 131 5.4 Nearest Neighbor Learners 133 5.4.1 Pros and Cons 134 5.4.2 Is TFIDF Appropriate? 135 5.5 Feature Selection 136 5.5.1 Greedy Inclusion Algorithms 137 5.5.2 Truncation Algorithms 144 5.5.3 Comparison and Discussion 145 5.6 Bayesian Learners 147 5.6.1 Naive Bayes Learners 148 5.6.2 SmaU-Degree Bayesian Networks 152 5.7 Exploiting Hierarchy among Topics 155 5.7.1 Feature Selection 155 5.7.2 Enhanced Parameter Estimation 155 5.7.3 Training and Search Strategies 157 5.8 Maximum Entropy Learners 160 5.9 Discriminative Classification 163 5.9.1 Linear Least-Square Regression 163 5.9.2 Support Vector Machines 164 5.10 Hypertext Classification 169 5.10.1 Representing Hypertext for Supervised Learning 169 5.10.2 Rule Induction 171 5.11 Bibliographic Notes 173 6 SEMISUPERVISED LEARNING 6.1 Expectation Maximization 178 6.1.1 Experimental Results 179 6.1.2 Reducing the Belief in Unlabeled Documents 181 6.1.3 Modeling Labels Using Many Mixture Components 183 6.2 Labeling Hypertext Graphs 184 6.2.1 Absorbing Features from Neighboring Pages 185 6.2.2 A Relaxation Labeling Algorithm 188 6.2.3 A Metric Graph-Labeling Problem 193 6.3 Co-training 195 6.4 Bibliographic Notes 198 PART Ⅲ APPLICATIONS 7 SOCIAL NETWORK ANALYSIS 7.1 Social Sciences and Bibliometry 205 7.1.1 Prestige 205 7.1.2 Centrality 206 7.1.3 Co-citation 207 7.2 PageRank and HITS 209 7.2.1 PageRank 209 7.2.2 HITS 212 7.2.3 Stochastic HITS and Other Variants 216 7.3 Shortcomings of the Coarse-Grained Graph Model 219 7.3.1 Artifacts of Web Authorship 219 7.3.2 Topic Contamination and Drift 223 7.4 Enhanced Models and Techniques 225 7.4.1 Avoiding Two-Party Nepotism 225 7.4.2 Outlier Elimination 226 7.4.3 Exploiting Anchor Text 227 7.4.4 Exploiting Document Markup Structure 228 7.5 Evaluation of Topic Distillation 235 7.5.1 HITS and Related Algorithms 235 7.5.2 Effect of Exploiting Other Hypertext Features 238 7.6 Measuring and Modeling the Web 243 7.6.1 Power-Law Degree Distributions 243 7.6.2 The "Bow Tie" Structure and Bipartite Cores 246 7.6.3 Sampling Web Pages at Random 246 7.7 Bibliographic Notes 254 8 RESOURCE DISCOVERY 8.1 Collecting Important Pages Preferentially 257 8.1.1 Crawling as Guided Search in a Graph 257 8.1.2 Keyword-Based Graph Search 259 8.2 Similarity Search Using Link Topology 264 8.3 Topical Locality and Focused Crawling 268 8.3.1 Focused Crawling 270 8.3.2 Identifying and Exploiting Hubs 277 8.3.3 Learning Context Graphs 279 8.3.4 Reinforcement Learning 280 8.4 Discovering Communities 284 8.4.1 Bipartite Cores as Communities 284 8.4.2 Network Flow/Cut-Based Notions of Communities 285 8.5 Bibliographic Notes 288 9 THE FUTURE OF WEB MINING 9.1 Information Extraction 290 9.2 Natural Language Processing 295 9.2.1 Lexical Networks and Ontologies 296 9.2.2 Part-of-Speech and Sense Tagging 297 9.2.3 Parsing and Knowledge Representation 299 9.3 Question Answering 302 9.4 Profiles, Personalization, and Collaboration 305 References 307 Index 327 |
随便看 |
|
霍普软件下载网电子书栏目提供海量电子书在线免费阅读及下载。